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Abstract 


This  dissertation  addresses  the  problem  of  designing  algorithms  for  learning  in  em¬ 
bedded  systems.  This  problem  differs  from  the  traditional  supervised  learning  prob¬ 
lem.  An  agent,  finding  itself  in  a  particular  input  situation  must  generate  an  ac¬ 
tion.  It  then  receives  a  reinforcement  value  from  the  environment,  indicating  how 
valuable  the  current  state  of  the  environment  is  for  the  agent.  The  agent  cannot, 
however,  deduce  the  reinforcement  value  that  would  have  resulted  from  executing 
any  of  its  other  actions.  A  number  of  algorithms  for  learning  action  strategies 
from  reinforcement  values  are  presented  and  compared  empirically  with  existing 
reinforcement- learning  algorithms. 

The  interval-estimation  algorithm  uses  the  statistical  notion  of  confidence  in¬ 
tervals  to  guide  its  generation  of  actions  in  the  world,  trading  off  acting  to  gain 
information  against  acting  to  gain  reinforcement.  It  performs  well  in  simple  do¬ 
mains  but  does  not  exhibit  any  generalization  and  is  computationally  complex. 

The  cascade  algorithm  is  a  structural  credit-assignment  method  that  allows  an 
action  strategy  with  many  output  bits  to  be  learned  by  a  collection  of  reinforcement- 
learning  modules  that  learn  Boolean  functions.  This  method  represents  an  improve¬ 
ment  in  computational  complexity  and  often  in  learning  rate. 

Two  algorithms  for  learning  Boolean  functions  in  fc-DNF  are  described.  Both 
are  based  on  Valiant’s  algorithm  for  learning  such  functions  from  input-output  in¬ 
stances.  The  first  uses  Sutton’s  techniques  for  linear  association  and  reinforcement 
comparison,  while  the  second  uses  techniques  from  the  interval  estimation  algo¬ 
rithm.  They  both  perform  well  and  have  tractable  complexity. 


v 


A  generate-and-test  reinforcement-learning  algorithm  is  presented.  It  allows 
symbolic  representations  of  Boolean  functions  to  be  constructed  incrementally  and 
tested  in  the  environment.  It  is  highly  parametrized  and  can  be  timed  to  learn 
a  broad  range  of  function  classes.  Low-complexity  functions  can  be  learned  very 
efficiently  even  in  the  presence  of  large  numbers  of  irrelevant  input  bits.  This 
algorithm  is  extended  to  construct  simple  sequential  networks  using  a  set-reset 
operator,  which  allows  the  agent  to  learn  action  strategies  with  state. 

These  algorithms,  in  addition  to  being  studied  in  simulation,  were  implemented 
and  tested  on  a  physical  mobile  robot. 
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Chapter  1 
Introduction 


Embedded  systems,  such  as  autonomous  robots  and  process  controllers,  must  be 
able  to  learn  about  and  adapt  to  their  environments.  This  dissertation  addresses 
the  problem  of  designing  algorithms  for  learning  in  embedded  systems.  It  provides 
a  formal  framework  in  which  this  problem  can  be  explored,  discusses  previous  work 
in  this  area,  and  then  goes  on  to  present  novel  algorithms  for  efficient  and  effective 
learning  in  embedded  systems.  These  algorithms  are  explored  theoretically  and  are 
validated  empirically,  both  in  simulation  and  in  use  on  a  mobile  robot. 


1.1  Why  Learn? 

Why  should  we  build  learning  agents?  A  program  that  “learns”  is  not  intrinsically 
better  than  one  that  does  not. 

One  reason  to  build  learning  agents  is  that  it  is  very  difficult  for  humans  to  write 
explicit  programs  for  agents  that  must  work  in  complex,  uncertain  environments. 
In  programming  robots,  for  instance,  it  is  common  for  a  human  programmer  to 
learn  a  great  deal  about  the  operation  of  the  robot’s  sensors  and  effectors  in  the 
course  of  debugging  programs  for  the  robot.  It  would  be  much  easier  and  less  time- 
consuming  if  the  programmer  were  able  to  articulate  only  general  principles  about 
the  environment,  allowing  the  robot  to  experiment  and  learn  about  the  details. 
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CHAPTER  1.  INTRODUCTION 


Another  reason  for  building  agents  that  learn  to  act  is  that  we  would  like  to 
have  agents  that  are  flexible  enough  to  work  in  a  variety  of  environments,  adapting 
their  perception  and  action  strategies  to  the  worlds  in  which  they  find  themselves. 
Even  if  a  human  could  completely  specify  the  program  for  an  agent  operating  in 
a  particular  environment,  the  agent’s  program  would  have  to  be  respecified  if  the 
agent  were  moved  to  a  new  environment. 


1.2  Reinforcement  Learning 

When  building  learning  agents,  the  goal  of  the  agent’s  designer  is  for  the  agent 
to  learn  what  actions  it  should  perform  in  which  situations  in  order  to  maximize 
an  external  measure  of  success.  All  of  the  information  the  agent  has  about  the 
external  world  is  contained  in  a  series  of  inputs  that  it  receives  from  the  environment. 
These  inputs  may  encode  information  ranging  from  the  output  of  a  vision  system 
to  a  robot’s  current  battery  voltage.  The  agent  can  be  in  many  different  states 
of  information  about  the  environment,  and  it  must  map  each  of  these  information 
states,  or  situations,  to  a  particular  action  that  it  can  perform  in  the  world.  The 
agent’s  mapping  from  situations  to  actions  is  referred  to  as  an  action  map.  Part 
of  the  agent’s  input  from  the  world  encodes  the  agent’s  reinforcement ,  which  is  a 
scalar  measure  of  how  well  the  agent  is  performing  in  the  world.  The  agent  should 
learn  to  act  in  such  a  way  as  to  maximize  the  total  reinforcement  it  gains  over  its 
lifetime. 

As  a  concrete  example,  consider  a  simple  robot  with  two  wheels  and  two  photo¬ 
sensors.  It  can  execute  five  different  actions:  stop,  go  forward,  go  backward,  turn 
left,  and  turn  right.  It  can  sense  three  different  states  of  the  world:  the  light 
in  the  left  eye  is  brighter  than  that  in  the  right  eye,  the  light  in  the  right  eye  is 
brighter  than  that  in  the  left  eye,  and  the  light  in  both  eyes  is  roughly  equally  bright. 
Additionally,  the  robot  is  given  high  values  of  reinforcement  when  the  average  value 
of  light  in  the  two  eyes  is  increased  from  the  previous  instant.  In  order  to  maximize 
its  reinforcement,  this  robot  should  turn  left  when  the  light  in  its  left  eye  is  brighter, 
turn  right  when  the  light  in  its  right  eye  is  brighter,  and  move  forward  when  the 
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light  in  both  eyes  is  equal.  The  problem  of  learning  to  act  is  to  discover  such  a 
mapping  from  information  states  to  actions. 

Thus,  the  problem  of  learning  to  act  can  be  cast  as  a  function-learning  problem: 
the  agent  must  learn  a  mapping  from  the  situations  in  which  it  finds  itself,  repre¬ 
sented  by  streams  of  input  values,  to  the  actions  it  can  perform.  In  the  simplest 
case,  the  mapping  will  be  a  pure  function  of  the  current  input  value,  but  in  general 
it  can  have  state,  allowing  the  action  taken  at  a  particular  time  to  depend  on  the 
entire  stream  of  previous  input  values. 

In  the  past  few  years  there  has  been  a  great  deal  of  work  in  the  artificial  in¬ 
telligence  (AI)  and  theoretical  computer  science  communities  on  the  problem  of 
learning  pure  Boolean-valued  functions  [31,43,50,55,76].  Unfortunately,  this  work 
is  not  directly  relevant  to  the  problem  of  learning  action  maps  because  of  the  differ¬ 
ent  settings  of  the  problem.  In  the  traditional  function-learning  work,  often  referred 
to  in  the  AI  community  as  “concept  learning,”  a  learning  algorithm  is  presented 
with  a  set  or  series  of  input-output  pairs  that  specify  the  correct  output  to  be  gener¬ 
ated  for  that  particular  input.  This  setting  allows  for  effective  function  learning,  but 
differs  from  the  situation  of  an  agent  trying  to  learn  an  action  map.  The  agent,  find¬ 
ing  itself  in  a  particular  input  situation,  must  generate  an  action.  It  then  receives 
a  reinforcement  value  from  the  environment,  indicating  how  valuable  the  current 
world  state  is  for  the  agent.  The  agent  cannot,  however,  deduce  the  reinforcement 
value  that  would  have  resulted  from  executing  any  of  its  other  actions.  Also,  if  the 
environment  is  noisy,  as  it  will  be  in  general,  just  one  instance  of  performing  an 
action  in  a  situation  may  not  give  an  accurate  picture  of  the  reinforcement  value  of 
that  action. 

This  learning  scenario  reduces  to  concept  learning  when  the  agent  has  only  two 
possible  actions,  the  world  generates  Boolean  reinforcement  that  depends  only  on 
the  most  recently  taken  action,  there  is  exactly  one  action  that  generates  the  high 
reinforcement  value  in  each  situation,  and  there  is  no  noise.  In  this  case,  from 
performing  a  particular  action  in  a  situation,  the  agent  can  deduce  that  it  was  the 
correct  action  if  it  was  positively  reinforced;  otherwise  it  can  infer  that  the  other 
action  would  have  been  correct. 
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The  problem  of  learning  action  maps  by  trial  and  error  is  often  referred  to 
as  reinforcement  learning  because  of  its  similarity  to  models  used  in  psychological 
studies  of  behavior-learning  in  humans  and  animals  [22].  It  is  also  referred  to  as 
“learning  with  a  critic,”  in  contrast  with  the  “learning  with  a  teacher”  of  tradi¬ 
tional  supervised  concept  learning  [81].  One  of  the  most  interesting  facets  of  the 
reinforcement-learning  problem  is  the  tension  between  performing  actions  that  are 
not  well  understood  in  order  to  gain  information  about  their  reinforcement  value 
and  performing  actions  that  are  expected  to  be  good  in  order  to  increase  overall 
reinforcement.  If  an  agent  knows  that  a  particular  action  works  well  in  a  certain 
situation,  it  must  trade  off  performing  that  action  against  performing  another  one 
that  it  knows  nothing  about,  in  case  the  second  action  is  even  better  than  the  first. 
Or,  as  Ashby  [6]  put  it, 

The  process  of  trial  and  error  can  thus  be  viewed  from  two  very  different 
points  of  view.  On  the  one  hand  it  can  be  regarded  as  simply  an  attempt 
at  success;  so  that  when  it  fails  we  give  zero  marks  for  success.  FVom  this 
point  of  view  it  is  merely  a  second-rate  way  of  getting  to  success.  There 
is,  however,  the  other  point  of  view  that  gives  it  an  altogether  higher 
status,  for  the  process  may  be  playing  the  invaluable  part  of  gathering 
information,  information  that  is  absolutely  necessary  if  adaptation  is  to 
be  successfully  achieved. 

The  longer  the  time  span  over  which  the  agent  will  be  acting,  the  more  important 
it  is  for  the  agent  to  be  acting  on  the  basis  of  correct  information.  Acting  to  gain 
information  may  improve  the  expected  long-term  performance  while  causing  short¬ 
term  performance  to  decline. 

Another  important  aspect  of  the  reinforcement-learning  problem  is  that  the  ac¬ 
tions  that  an  agent  performs  influence  the  input  situations  in  which  it  will  find  itself 
in  the  future.  Rather  than  receiving  an  independently  chosen  set  of  input-output 
pairs,  the  agent  has  some  control  over  what  inputs  it  will  receive  and  complete 
control  over  what  outputs  will  be  generated  in  response.  In  addition  to  making 
it  difficult  to  make  distributional  statements  about  the  inputs  to  the  agent,  this 
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degree  of  control  makes  it  possible  for  what  seem  like  small  “experiments”  to  cause 
the  agent  to  discover  an  entirely  new  part  of  its  environment. 


1.3  Models  versus  Action  Maps 

One  way  for  an  agent  to  learn  an  action  map  is  first  to  learn  a  state-transition  model 
of  the  world  and  the  expected  reinforcement  value  gained  from  being  in  each  world 
state,  and  then  to  apply  standard  dynamic  programming  techniques  to  choose  the 
best  action  from  any  given  world  state.  Although  this  method  will  work  in  the 
general  case,  the  internal  structures  that  the  agent  must  build  up  will  tend  to  be 
quite  complex. 

When  the  target  action-map  is  state-free,  it  can  be  represented  much  more 
compactly  and  executed  much  more  directly  as  a  simple  function,  rather  than  as 
a  world  model  with  a  procedure  for  choosing  the  optimal  action.  Sutton  [72]  and 
Whitehead  and  Ballard  [80]  have  found  that  in  cases  in  which  the  reinforcement 
from  the  world  is  delayed,  learning  may  be  sped  up  by  a  kind  of  compilation  from 
a  world  model.  However,  this  opens  up  the  new  problem  of  learning  world  models, 
which  has  been  addressed  by  a  number  of  people,  including  Sutton  and  Pinette  [73], 
Drescher  [18],  Mason,  Christiansen,  and  Mitchell  [40],  Mel  [42],  and  Shen  [68]. 

This  dissertation  will  focus  on  methods  for  learning  action  maps  without  using 
models.  Even  those  methods  that  do  use  models  have  this  simpler  form  of  reinforce¬ 
ment  learning  as  a  component,  so  improved  algorithms  for  learning  action  maps  will 
benefit  both  approaches. 


1.4  Statistical  versus  Symbolic  Learning 

Most  previous  learning  work  can  be  divided  into  statistical  and  symbolic  methods. 

Statistical  learning  encompasses  much  of  the  early  learning  work  in  pattern 
recognition  [54]  and  adaptive  control  [25],  as  well  as  current  work  in  artificial  neural 
networks  (also  known  as  connectionist  systems)  [9].  The  internal  representations 
used  are  typically  numeric  and  the  correctness  of  algorithms  is  demonstrated  using 
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statistical  methods.  These  systems  tend  to  be  highly  noise-tolerant  and  robust. 
However,  the  internal  states  are  difficult  for  humans  to  interpret  and  the  algorithms 
often  perform  poorly  on  complex  problems. 

More  symbolic  approaches  to  learning,  such  as  those  standardly  pursued  in  the 
artificial  intelligence  community,  attempt  to  address  these  issues  of  understandabil- 
ity  and  complexity.  They  have  resulted  in  algorithms,  such  as  Mitchell’s  version 
spaces  [49]  and  Michalski’s  STAR  [43],  that  use  easily-interpretable  symbolic  repre¬ 
sentations  and  whose  correctness  hinges  on  arguments  from  logic  rather  than  from 
statistics.  These  algorithms  tend  to  suffer  from  noise-intolerance  and  high  compu¬ 
tational  complexity,  more  so  than  statistical  algorithms  do. 

One  of  the  aims  of  the  work  in  this  dissertation  is  to  blend  the  statistical  and 
the  symbolic  in  algorithms  for  reinforcement  learning  in  embedded  systems.  An 
important  characteristic  of  most  embedded  systems  is  that  they  operate  in  environ¬ 
ments  that  are  not  (to  them)  completely  predictable.  In  order  to  work  effectively 
in  such  environments,  a  system  must  be  able  to  summarize  general  tendencies  of 
its  environment.  The  well-understood  methods  of  statistics  are  most  appropriate 
for  this  task.  This  does  not,  however,  mean  we  must  abandon  all  of  the  benefits  of 
symbolic  AI  methods.  Rather,  these  two  approaches  can  be  synthesized  to  make 
learning  systems  that  are  robust  and  noise-tolerant  as  well  as  being  easy  to  under¬ 
stand  and  capable  of  working  in  complex  environments.  A  good  example  of  this 
kind  of  synthesis  is  Quinlan’s  successful  concept-learning  method,  ID3  [55].  Within 
the  combined  approach,  complexity  issues  can  be  addressed  by  explicitly  considering 
limited  classes  of  functions  to  be  learned. 

Many  researchers  use  symbolic  representations  because,  as  Michie  [45]  puts  it, 
“In  Al-type  learning,  explainability  is  all.”  That  is  not  the  motivation  for  this 
work,  which  simply  seeks  the  most  effective  algorithms  for  building  embedded  sys¬ 
tems.  There  is,  however,  an  important  benefit  of  using  symbolic  representations  of 
concepts  and  strategies  being  learned  by  an  agent:  it  may  allow  the  learned  knowl¬ 
edge  to  be  more  easily  integrated  with  knowledge  that  is  provided  by  humans  at 
design  time.  Although  such  integration  is  not  explored  in  this  dissertation,  it  is  an 
important  direction  in  which  learning  research  should  be  pursued. 
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The  next  chapter  addresses  the  formal  foundations  of  reinforcement  learning.  These 
arise  largely  from  previous  work  in  statistics,  dynamic  programming,  and  learning- 
automata  theory.  These  foundations  axe  important  to  AI  because  they  allow  widely 
disparate  algorithms  to  be  compared  in  common,  objective  terms.  Chapter  3  goes 
on  to  present  previous  work  on  algorithms  for  reinforcement  learning  from  a  variety 
of  different  literatures.  This  previous  work  is  the  direct  basis  of  many  of  the  new 
algorithms  and  results  presented  in  this  dissertation. 

Chapter  4  describes  a  novel  statistical  algorithm  for  reinforcement  learning. 
It  empirically  shows  this  algorithm  to  be  more  effective  than  a  variety  of  other 
reinforcement-learning  algorithms.  Finally,  it  discusses  weaknesses  of  this  algo¬ 
rithm  and  other  related  algorithms,  due  to  high  computational  complexity  and  lack 
of  generalization  across  input  instances. 

Chapter  5  describes  a  problem  reduction  and  an  algorithm  that  can  be  used  to 
implement  it.  The  problem  of  learning  an  action  map  with  many  output  bits  can 
be  reduced  to  the  problem  of  learning  many  action  maps,  each  with  a  single  output 
bit.  This  will  allow  us  to  restrict  our  attention  to  learning  action  maps  that  can 
be  described  as  Boolean  functions,  knowing  they  can  be  recombined  to  form  more 
complex  systems. 

Chapters  6  and  7  each  present  a  novel  algorithm  for  learning  Boolean  func¬ 
tions  from  reinforcement;  these  algorithms  represent  points  on  a  generality-efficiency 
tradeoff.  The  algorithm  in  Chapter  6  is  restricted  to  learning  Boolean  functions  de- 
scribable  as  propositional  formulae  in  the  class  fc-DNF,  but  it  learns  these  functions 
more  efficiently  than  the  algorithms  of  Chapters  3  and  4.  The  algorithm  in  Chapter 
7  is  more  flexible — according  to  the  settings  of  internal  parameters,  it  can  be  made 
more  or  less  restricted  and,  hence,  more  or  less  efficient. 

All  of  the  discussion  up  to  this  point  has  been  of  environments  that  present 
reinforcement  immediately  and  of  action  maps  that  are  pure,  state-free  functions. 
Chapter  8  presents  an  extended  version  of  the  algorithm  of  Chapter  7  that  can  learn 
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simple  action  maps  with  state.  Chapter  9  addresses  the  problem  of  delayed  rein¬ 
forcement.  It  presents  two  existing  methods  and  shows  how  they  may  be  combined 
with  the  statistical  method  developed  in  Chapter  4. 

The  algorithms  presented  in  this  dissertation  are  finally  validated  through  their 
application  to  moderately  complex  domains,  including  a  real  mobile  robot.  Chapter 
10  describes  these  experiments,  documenting  their  successes  and  failures.  Finally, 
Chapter  11  summarizes  the  work  presented  in  the  previous  chapters.  It  notes  prob¬ 
lems  and  points  out  important  directions  for  future  research. 
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Chapter  2 


Foundations 


This  chapter  focuses  on  building  formal  foundations  for  the  problem  of  learning  in 
embedded  systems.  These  foundations  must  allow  a  clear  statement  of  the  problem 
and  provide  a  basis  for  evaluating  and  comparing  learning  algorithms.  It  is  impor¬ 
tant  to  establish  such  a  basis:  there  are  many  instances  in  the  machine  learning 
literature  of  researchers  doing  interesting  work  on  learning  systems,  but  reporting 
the  results  using  evaluation  metrics  that  make  it  difficult  to  compare  their  results 
with  the  results  of  others.  The  foundational  ideas  presented  in  this  chapter  are  a 
synthesis  of  previous  work  in  statistics  [12],  dynamic  programming  [57],  the  theory 
of  learning  automata  [53],  and  previous  work  on  the  foundations  of  reinforcement 
learning  [8,70,71,78,83,84]. 


2.1  Acting  in  a  Complex  World 

An  embedded  system,  or  agent,  can  be  seen  as  acting  in  a  world,  continually  exe¬ 
cuting  a  procedure  that  maps  the  agent’s  perceptual  inputs  to  its  effector  outputs. 
Its  world,  or  environment,  is  everything  that  is  outside  the  agent  itself,  possibly 
including  other  robotic  agents  or  humans.  The  agent  operates  in  a  cycle,  receiving 
an  input  from  the  world,  performing  some  computation,  then  generating  an  output 
that  affects  the  world.  The  mapping  that  it  uses  may  have  state  or  memory,  allow¬ 
ing  its  action  at  any  time  to  depend,  potentially,  on  the  entire  stream  of  inputs  that 
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it  has  received  until  that  time.  Such  a  mapping  from  an  input  stream  to  an  output 
stream  is  referred  to  as  a  behavior. 

In  order  to  study  the  effectiveness  of  particular  behaviors,  whether  or  not  they 
involve  learning,  we  must  model  the  connection  between  agent  and  world,  under¬ 
standing  how  an  agent’s  actions  affect  its  world  and,  hence,  its  own  input  stream. 


2.1.1  Modeling  an  Agent’s  Interaction  with  the  World 

The  world  can  be  modeled  as  a  deterministic  finite  automaton  whose  state  transi¬ 
tions  depend  on  the  actions  of  an  agent  [41].  From  the  agent’s  perspective,  the  world 
is  everything  that  is  not  itself,  including  other  agents  and  processes.  This  model 
will  be  extended  to  include  non-deterministic  worlds  in  the  next  section.  A  world 
can  be  formally  modeled  as  the  triple  (5,  A,  W),  in  which  S  is  the  set  of  possible 
states  of  the  world,  A  is  the  set  of  possible  outputs  from  the  agent  to  the  world  (or 
actions  that  can  be  performed  by  the  agent),  and  W  is  the  state  transition  function, 
mapping  S  x  A  into  S.  Once  the  world  has  been  fixed,  the  agent  can  be  modeled  as 
the  4-tuple  (I,  /,  R,  B )  where  X  is  the  set  of  possible  inputs  from  the  world  to  the 
agent,  I  is  a  mapping  from  S  to  X  that  determines  which  input  the  agent  will  receive 
when  the  world  is  in  a  given  state,  R  is  the  reinforcement  function  of  the  agent  that 
maps  S  into  real  numbers  (it  may  also  be  useful  to  consider  more  limited  models 
in  which  the  output  of  the  reinforcement  function  is  Boolean- valued),  and  B  is  the 
behavior  of  the  agent,  mapping  X*  (streams  of  inputs)  into  A.  The  expressions  i(t) 
and  a(t)  will  denote  the  input  received  by  the  agent  at  time  t  and  the  action  taken 
by  the  agent  at  time  t,  respectively. 

The  process  of  an  agent’s  interaction  with  the  world  is  depicted  in  Figure  1. 
The  world  is  in  some  internal  state,  s,  which  is  projected  into  i  and  r  by  the  input 
and  reinforcement  functions  I  and  R.  These  values  serve  as  inputs  to  the  agent’s 
behavior,  f?,  which  generates  an  action  a  as  output.  Once  per  synchronous  cycle 
of  this  system,  the  value  of  a,  together  with  the  old  value  of  world  state  s,  is 
transformed  into  a  new  value  of  world  state  s  by  the  world’s  transition  function  W. 
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Figure  1:  An  agent’s  interaction  with  its  world. 


Note  that  if  the  agent  does  not  have  a  simple  stimulus-response  behavior,  but 
has  some  internal  state,  then  the  action  taken  by  the  behavior  can  be  a  function 
of  both  its  input  and  its  internal  state.  This  internal  state  may  allow  the  agent 
to  discriminate  among  more  states  of  the  world  and,  hence,  to  obtain  higher  rein¬ 
forcement  values  by  performing  more  appropriate  actions.  To  simplify  the  following 
discussion,  actions  will  be  conditioned  only  on  the  input,  but  the  treatment  can  be 
extended  to  the  case  in  which  the  action  depends  on  the  agent’s  internal  state  as 
well. 

2.1.2  Inconsistent  Worlds 

One  of  the  most  difficult  problems  that  a  learning  agent  must  contend  with  is 
apparent  inconsistency.  A  world  is  said  to  be  apparently  inconsistent  for  an  agent  if 
it  is  possible  that,  on  two  different  occasions  in  which  the  agent  receives  the  same 
input  and  generates  the  same  action,  the  next  states  of  the  world  differ  in  their 
reinforcement  or  the  world  changes  state  in  such  a  way  that  the  same  string  of 
future  actions  will  have  different  reinforcement  results. 


12 


CHAPTER  2.  FOUNDATIONS 


There  are  many  different  phenomena  that  can  account  for  apparent  inconsis¬ 
tency: 

•  The  agent  does  not  have  the  ability  to  discriminate  among  all  world  states. 
If  the  agent’s  input  function  I  is  not  one-to-one,  which  will  be  the  case  in 
general,  then  an  individual  input  could  have  arisen  from  many  world  states. 
When  some  of  those  states  respond  differently  to  different  actions,  the  world 
will  appear  inconsistent  to  the  agent. 

•  The  agent  has  “faulty”  sensors.  Some  percentage  of  the  time,  the  world  is  in 
a  state  s,  which  should  cause  the  agent  to  receive  I(s)  as  input,  but  it  appears 
that  the  world  is  in  some  other  state  s',  causing  the  agent  to  receive  I  (s')  as 
input  instead.  Along  with  the  probability  of  error,  the  nature  of  the  errors 
must  be  specified:  are  the  erroneously  perceived  states  chosen  maliciously, 
or  according  to  some  distribution  over  the  state  space,  or  contingently  upon 
what  was  to  have  been  the  correct  input? 

•  The  agent  has  “faulty”  effectors.  Some  percentage  of  the  time,  the  agent 
generates  action  a,  but  the  world  actually  changes  state  as  if  the  agent  had 
generated  a  different  action  a'.  As  above,  both  the  probability  and  nature  of 
the  errors  must  be  specified. 

•  The  world  has  a  probabilistic  transition  function.  In  this  case,  the  world  is  a 
stochastic  automaton  whose  transition  function,  W',  actually  maps  S  x  A  into 
a  probability  distribution  over  S  (a  mapping  from  S  into  the  interval  [0, 1]) 
that  describes  the  probability  that  each  of  the  states  in  S  will  be  the  next 
state  of  the  world. 

Some  specific  cases  of  noise  phenomena  above  have  been  studied  in  the  formal 
function-learning  literature.  Valiant  [76]  has  explored  a  model  of  noise  in  which, 
with  some  small  probability,  the  entire  input  instance  to  the  agent  can  be  chosen 
maliciously.  This  corresponds,  roughly,  to  having  simultaneous  faults  in  sensing 
and  action  that  can  be  chosen  in  a  way  that  is  maximally  bad  for  the  learning  algo¬ 
rithm.  This  model  is  overly  pessimistic  and  is  hard  to  justify  in  practical  situations. 


2.1.  ACTING  IN  A  COMPLEX  WORLD 


13 


Angluin  [5]  works  with  a  model  of  noise  in  which  input  instances  are  misclassified 
with  some  probability;  that  is,  the  output  part  of  an  input-output  pair  is  specified 
incorrectly.  This  is  a  more  realistic  model  of  noise,  but  is  not  directly  applicable  to 
the  action-learning  problem  under  consideration  here. 

If  the  behavior  of  faulty  sensors  and  effectors  is  not  malicious,  the  inconsistency 
they  cause  can  be  described  by  transforming  the  original  world  model  into  one  in 
which  the  set  of  world  states,  5,  is  identical  to  the  set  of  agent  inputs,  J,  and 
in  which  the  world  has  a  probabilistic  transition  function.  Inconsistency  due  to 
inability  to  discriminate  among  world  states  can  also  be  modeled  in  this  way,  but 
such  a  model  is  correct  only  for  the  one-step  transition  probabilities  of  the  system. 
Reducing  each  of  these  phenomena  to  probabilistic  world-transition  functions  allows 
the  rest  of  the  discussion  of  embedded  behaviors  to  ignore  the  other  possible  modes 
of  inconsistency.  The  remainder  of  this  section  shows  how  to  transform  worlds  with 
each  type  of  inconsistency  into  worlds  with  state  set  T  and  probabilistic  transition 
functions. 

Consider  an  agent,  embedded  in  a  world  with  deterministic  transition  function 
W ,  whose  effectors  axe  faulty  with  probability  p,  so  that  when  the  intended  action  is 
a,  the  actual  action  is  i /(a).  This  agent’s  situation  can  be  described  by  a  probabilistic 
world  transition  function  W'(s,  a)  that  maps  the  value  of  W(s,  a)  to  the  probability 
value  1  —  Pi  the  value  of  W^(s,  v(a))  to  the  probability  value  p  and  all  other  states 
to  probability  value  0.  That  is, 

W\s,a)(W(s,a))  =  1  -p 

W'MiWMa))  =  p 

The  result  of  performing  action  a  in  state  s  will  be  W(s,a)  with  probability  1  - 
Pi  and  W (s,  i'(c))  with  probability  p.  Figure  2  depicts  this  transition  function. 
First,  a  deterministic  transition  is  made  based  on  the  action  of  the  agent;  then,  a 
probabilistic  transition  is  made  by  the  world.  This  model  can  be  easily  extended 
to  the  case  in  which  v  is  a  mapping  from  actions  to  probability  distributions  over 
actions.  For  all  a'  not  equal  to  a,  the  value  of  W(s,  a')  is  mapped  to  the  probability 
value  p  i '(a)(a'),  which  is  the  probability,  p,  of  an  error  times  the  probability  that 
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Figure  2:  Modeling  faulty  effectors  as  a  probabilistic  world  transition  function. 


action  a'  will  be  executed  given  that  the  agent  intended  to  execute  the  action  a. 
The  value  of  W(s,a)  is  mapped  to  the  probability  value  1  —  p  +  p  v(a)(a),  which 
is  the  probability  that  there  is  no  error,  plus  the  probability  that  the  error  actually 
maps  back  to  the  correct  action. 

Faulty  input  sensors  are  somewhat  more  difficult  to  model.  Let  the  agent’s 
sensors  be  faulty  with  probability  p,  yielding  a  value  I(u(s))  rather  than  I(s).  We 
can  construct  a  new  model  with  a  probabilistic  world-transition  function  in  which 
the  states  of  the  world  axe  those  that  the  agent  thinks  it  is  in.  The  model  can  be 
most  simply  viewed  if  the  world  makes  more  than  one  probabilistic  transition,  as 
shown  in  Figure  3.  If  it  appears  that  the  world  is  in  state  s,  then  with  probability 
ps,  it  actually  is,  and  the  first  transition  is  to  the  same  state.  The  rest  of  the 
probability  mass  is  distributed  over  the  other  states  in  the  inverse  image  of  s  under 
v,  causing  a  transition  to  some  world  state  s'  with  probability  pa>.  Next, 

there  is  a  transition  to  a  new  state  on  the  basis  of  the  agent’s  action  according  to 
the  original  transition  function  W.  Finally,  with  probability  p,  the  world  makes  a 
transition  to  the  state  i/(W(s',a)),  allowing  for  the  chance  that  this  result  will  be 
misperceived  on  the  next  tick.  In  Figure  4,  the  diagram  of  Figure  3  is  converted 
into  a  more  standard  form,  in  which  the  agent  performs  an  action,  and  then  the 
world  makes  a  probabilistic  transition.  This  construction  can  also  be  extended  to 
the  cases  in  which  i/(s)  is  a  probability  distribution  over  S  and  in  which  the  initial 
world-transition  function  is  probabilistic. 
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Figure  3:  Modeling  faulty  sensors  with  multiple  probabilistic  transitions. 
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W(s,a) 

v(W(s,a)) 


v(W(s',a)) 


*  W(s',a ) 

Figure  4:  Modeling  faulty  sensors  as  a  probabilistic  world  transition  function. 


We  can  construct  an  approximate  model  of  an  agent’s  inability  to  discriminate 
among  world  states  by  creating  a  new  model  of  the  world  in  which  the  elements 
of  2  are  the  states,  standing  for  equivalence  classes  of  the  states  in  the  old  model. 
Let  {si, ...,  s„}  be  the  inverse  image  of  i  under  I.  There  is  a  probabilistic  transition 
to  each  of  the  sj,  based  on  the  probability,  pj,  that  the  world  is  in  state  sj  given 
that  the  agent  received  the  input  i.  From  each  of  these  states,  the  world  makes 
a  transition  on  the  basis  of  the  agent’s  action,  a,  to  the  state  W(sj,a),  which  is 
finally  mapped  back  down  to  the  new  state  space  by  the  function  I.  This  process  is 
depicted  in  Figure  5  and  the  resulting  transition  function  is  shown  in  Figure  6.  The 
new  transition  function  gives  a  correct  1-step  model  of  the  transition  probabilities, 
but  will  not  generate  the  same  distribution  of  sequences  of  two  or  more  states. 

In  the  construction  for  faulty  sensors,  it  is  necessary  to  evaluate  the  probability 
that  the  world  is  in  some  state  s*,  given  that  it  appears  to  the  agent  to  be  in  another 
state  s.  This  probability  depends  on  the  unconditional  probability  that  the  world 
is  in  the  state  s*,  as  well  as  the  unconditional  probability  that  the  world  appears 
to  be  in  the  state  s.  These  unconditional  probabilities  depend,  in  the  general  case, 
on  the  behavior  that  the  agent  is  executing,  so  the  construction  cannot  be  carried 
out  before  the  behavior  is  fixed.  A  similar  problem  exists  for  the  case  of  lack  of 
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Figure  5:  Modeling  inability  to  discriminate  among  worlds. 


Figure  6:  Modeling  inability  to  discriminate  among  worlds  as  a  probabilistic  world 
transition  function. 
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discrimination:  it  is  necessary  to  evaluate  the  probability  that  the  world  is  in  each 
of  the  individual  states  in  the  inverse  image  of  input  £  under  I  given  that  the  agent 
received  input  £.  These  probabilities  also  depend  on  the  behavior  that  is  being 
executed  by  the  agent.  This  leads  to  a  very  complex  optimization  problem  that  is, 
in  its  general  form,  beyond  the  scope  of  this  work. 

This  dissertation  will  mainly  address  learning  in  worlds  that  are  globally  consis¬ 
tent  for  the  learning  agent.  A  world  is  globally  consistent  for  an  agent  if  and  only  if 
for  all  inputs  £  €  T  and  actions  a  €  A,  the  expected  value  of  the  reinforcement  given 
i  and  a  is  constant.  Global  consistency  allows  for  variations  in  the  result  of  perform¬ 
ing  an  action  in  a  situation,  as  long  as  the  expected,  or  average,  result  is  the  same. 
It  simply  requires  that  there  not  be  variations  in  the  world  that  are  undetectable  by 
the  agent  and  that  affect  its  choice  of  action.  Important  hidden  state  in  the  world 
can  cause  such  variations;  methods  for  learning  to  act  in  such  worlds  are  discussed 
in  Chapter  8.  If  the  transformation  described  above  has  been  carried  out  so  that 
the  sets  J  and  S  are  the  same,  the  requirement  for  global  consistency  is  tantamount 
to  requiring  that  the  resulting  world  be  a  Markov  decision  process  with  stationary 
transition  and  output  probabilities  [35].  In  addition,  the  following  discussion  will 
assume  that  the  world  is  consistent  over  changes  in  the  agent’s  behavior. 


2.1.3  Learning  Behaviors 

The  problem  of  programming  an  agent  to  behave  correctly  in  a  world  is  to  choose 
some  behavior  B,  given  that  the  rest  of  the  parameters  of  the  agent  and  world  are 
fixed.  If  the  programmer  does  not  know  everything  about  the  world,  or  if  he  wishes 
the  agent  he  is  designing  to  be  able  to  operate  in  a  variety  of  different  worlds,  he 
must  program  an  agent  that  will  learn  to  behave  correctly.  That  is,  he  must  find 
a  behavior  B'  that,  through  changing  parts  of  its  internal  state  on  the  basis  of  its 
perceptual  stream,  eventually  converges  to  some  behavior  B"  that  is  appropriate  for 
the  world  that  gave  rise  to  its  perceptions.  Of  course,  to  say  that  a  program  learns 
is  just  to  take  a  particular  perspective  on  a  program  with  internal  state.  A  behavior 
with  state  can  be  seen  as  “learning”  if  parts  of  its  state  eventually  converge  to  some 
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fixed  or  slowly-varying  values.  The  behavior  that  results  from  those  parameters 
having  been  fixed  in  that  way  can  be  called  the  “learned  behavior.”1 2 

A  learning  behavior  is  an  algorithm  that  learns  an  appropriate  behavior  for  an 
agent  in  a  world.  It  is  itself  a  behavior,  mapping  elements  of  J  to  elements  of  A, 
but  it  requires  the  additional  input  r,  which  designates  the  reinforcement  value  of 
the  world  state  for  the  agent.  A  learning  behavior  consists  of  three  parts:  an  initial 
state  so5  an  update  function  u,  and  an  evaluation  function  e?  At  any  moment,  the 
internal  state,  s,  encodes  whatever  information  the  learner  has  chosen  to  save  about 
its  interactions  with  the  world.  The  update  function  maps  an  internal  state  of  the 
learner,  an  input,  an  action,  and  a  reinforcement  value  into  a  new  internal  state, 
adjusting  the  current  state  based  on  the  reinforcement  resulting  from  performing 
that  action  in  that  input  situation.  The  evaluation  function  maps  an  internal  state 

1In  general,  it  is  very  difficult  to  formally  differentiate  between  processes  to  which  we  would  apply 
the  natural  language  term  “perception”  and  those  to  which  we  would  apply  the  term  “learning.”  In 
common  usage,  “perception”  tends  to  refer  to  gaining  information  that  is  specific,  transient,  or  at  a 
low  level  of  abstraction,  whereas  “learning”  tends  to  refer  to  more  general  information  that  is  true 
over  longer  time  spans.  This  issue  is  addressed  in  more  detail  in  a  paper  comparing  different  views 
of  the  nature  of  knowledge  [34]. 

2ftom  this  point  on,  the  variable  s  will  refer  to  an  internal  state  of  the  learning  behavior.  Because 
we  have  assumed  the  transformations  described  in  the  previous  section,  it  is  no  longer  important  to 
name  the  different  states  of  the  world. 
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8  :=  sO 
loop 

i  :=  input 
a  :=  e(s,i) 
output  & 

r  :=  reinforcement 
a  :=  u(«,i,a,r) 
and  loop 


Figure  8:  General  algorithm  for  learning  behaviors. 


and  an  input  into  an  action,  choosing  the  action  that  seems  most  useful  for  the  agent 
in  that  situation,  based  on  the  information  about  the  world  stored  in  the  internal 
state.  Recall  that  an  action  can  be  useful  for  an  agent  either  because  it  has  a  high 
reinforcement  value  or  because  the  agent  knows  little  about  its  outcome.  Figure  7 
shows  a  schematic  view  of  the  internal  structure  of  a  learning  behavior.  The  register 
s  has  initial  value  sq  and  can  be  thought  of  as  programming  the  evaluation  function 
e  to  act  as  a  particular  action  map.  The  update  function,  it,  updates  the  value  of 
s  on  each  clock  tick. 


A  general  algorithm  for  learning  behaviors,  based  on  these  three  components,  is 
shown  in  Figure  8.  The  internal  state  is  initialized  to  s0,  and  then  the  algorithm 
loops  forever.  An  input  is  read  from  the  world  and  the  evaluation  function  is 
applied  to  the  internal  state  and  the  input,  resulting  in  an  action,  which  is  then 
output.  At  this  point,  the  world  makes  a  transition  to  a  new  state.  The  program 
next  determines  the  reinforcement  associated  with  the  new  world  state,  uses  that 
information,  together  with  the  last  input  and  action,  to  update  the  internal  state, 
and  then  goes  back  to  the  top  of  its  loop.  Formulating  learning  behaviors  in  terms 
of  s0,  e,  and  u  facilitates  building  experimental  frameworks  that  allow  testing  of 
different  learning  behaviors  in  a  wide  variety  of  real  and  simulated  worlds. 
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2.2  Performance  Criteria 

In  order  to  compare  algorithms  for  learning  behaviors,  we  must  fix  the  criteria  on 
which  they  are  to  be  judged.  There  are  three  major  considerations:  correctness, 
convergence,  and  time-space  complexity.  First,  we  must  determine  the  correct  be¬ 
havior  for  an  agent  in  a  domain.  Then  we  can  measure  to  what  degree  a  learned 
behavior  approximates  the  correct  behavior  and  the  speed,  in  terms  of  the  number 
of  interactions  with  the  world,  with  which  it  converges.  We  must  also  be  concerned 
with  the  amount  of  time  and  space  needed  for  computing  the  update  and  evaluation 
functions  and  with  the  size  of  the  internal  state  of  the  algorithm. 

2.2.1  Correctness 

When  shall  we  say  that  a  behavior  is  correct  for  an  agent  in  an  environment? 
There  are  many  possible  answers  that  will  lead  to  different  learning  algorithms  and 
analyses.  An  important  quantity  is  the  expected  reinforcement  that  the  agent  will 
receive  in  the  next  instant,  given  that  the  current  input  is  i(t)  and  the  current  action 
is  a(t),  which  can  be  expressed  as 

er(i(t),a(t))  =  E(R(i(t  +  1))  |  i(t),a(t)) 

=  £J S(i')W"(i(<),  .(())(.')■ 

«'€  I 

It  is  the  sum,  over  all  possible  next  world  states,  of  the  probability  that  the  world 
will  make  a  transition  to  that  state  times  its  reinforcement  value.  This  formulation 
assumes  that  the  inputs  directly  correspond  to  the  states  of  the  world  and  that 
W'  is  a  probabilistic  transition  function.  If  the  world  is  globally  consistent  for  the 
agent,  the  process  is  Markov  and  the  times  are  irrelevant  in  the  above  definition, 
allowing  it  to  be  restated  as 

er(t-,a)  =  ^J2(0W(i,a)(0. 

«'€T 

One  of  the  simplest  criteria  is  that  a  behavior  is  correct  if,  at  each  step,  it 
performs  the  action  that  is  expected  to  cause  the  highest  reinforcement  value  to  be 
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received  on  the  next  step.  A  correct  behavior,  in  this  case,  is  one  that  generates 
actions  that  are  optimal  under  the  following  definition: 

V*  €  1,  a  €  A.  Opt (i,  a)  Va '  €  A.  er(i,  a)  >  er(t,  a')  . 

Optimal  behavior  is  defined  as  a  relation  on  inputs  and  actions  rather  than  as  a 
function,  because  there  may  be  many  actions  that  are  equally  good  for  a  given 
input.  However,  it  can  be  made  into  a  function  by  breaking  ties  arbitrarily.  This 
is  a  local  criterion  that  may  cause  the  agent  to  sacrifice  future  reinforcement  for 
immediately  attainable  current  reinforcement. 

The  concept  of  expected  reinforcement  can  be  made  more  globed  by  considering 
the  total  expected  reinforcement  for  a  finite  future  interval,  or  horizon ,  given  that 
an  action  was  taken  in  a  particular  input  situation.  This  is  often  termed  the  value 
of  an  action,  and  it  is  computed  with  respect  to  a  particular  behavior  (because  the 
value  of  the  next  action  taken  depends  crucially  on  how  the  agent  will  behave  after 
that).  In  the  following,  expected  reinforcement  is  computed  under  the  assumption 
that  the  agent  will  act  according  to  the  optimal  policy  the  rest  of  the  time.  The 
expected  reinforcement,  with  horizon  k,  of  doing  action  a  in  input  situation  i  at 
time  t  is  defined  as 

k 

erk(i(t),a(t ))  =  E($2  R(i(t  +  j))  |  i(t),a(t),Vh  <  k.  Opt k_h(i(t  +  h),a(t  +  h ))) . 

This  expression  can  be  simplified  to  a  recursive,  time-independent  formulation,  in 
which  the  fc-step  value  of  an  action  in  a  state  is  just  the  one-step  value  of  the  action 
in  the  state  plus  the  expected  k- 1  -step  value  of  the  optimal  action  for  horizon  k  —  1 
in  the  following  state: 

erk(i,t)  =  er(i,a)  +  £  W'(i,  aXi^er^i',  Opt^i'))  . 

«'€r 

This  definition  is  recursively  dependent  on  the  definition  of  optimality  k  steps  into 
the  future,  Opt*: 

Vi  €  T,a  6  A.  Opt*(i,a)  «-►  Va'  €  A.  erk(i,a )  >  er*(i,  a')  . 
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The  values  of  er1  and  Optj  are  just  er  and  Opt  given  above.  The  ib-step  value  of 
action  a  in  situation  i  at  time  t,  er*(z,  a),  can  be  computed  by  dynamic  program¬ 
ing  [12].  First,  the  Optx  relation  is  computed;  this  allows  the  er2  function  to  be 
calculated  for  all  i  and  a.  Proceeding  for  k  steps  will  generate  the  value  for  er*. 
Because  of  the  assumption  that  the  world  is  Markov,  these  values  are  not  dependent 
on  the  time.  However,  if  k  is  large,  the  computational  expense  of  this  method  is 
prohibitive. 

Another  way  to  define  global  optimality  is  to  consider  an  infinite  sum  of  future 
reinforcement  values  in  which  near  term  values  are  weighted  more  heavily  than 
values  to  be  received  in  the  distant  future.  This  is  referred  to  as  a  discounted 
sum,  depending  on  the  parameter  7  to  specify  the  rate  of  discounting.  Expected 
discounted  reinforcement  at  time  t  is  defined  as 

OO 

ery(i(t),a(t))  =  E^j^R^t  +  j))  \  i(t),a(t),Vh  >  0.  Opt .,(»(*  +  h),a(t  +  h)))  . 
j=i 

Properties  of  the  exponential  allow  us  to  reduce  this  expression  to 

er(i(t), a(t ))  +  7 er7(i(t  +  1),  a(t  +  1))  , 

which  can  be  expressed  independent  of  time  as 

ery(i,  a)  =  er(i,  a)  +  7  £  W'(i,  a)(i')er7(t',  Opt^(z/))  . 

t  'ex 

The  related  definition  of  7-discounted  optimality  is  given  by 

Vi  €  T,  a  €  A.  0pt7(t,a)  <-►  Va'  €  A.  er7(i,  a)  >  er7(i,a')  . 

For  a  given  value  of  7  and  a  proposed  definition  of  Opt,,  er1  can  be  found  by  solving 
a  system  of  equations,  one  for  each  possible  instantiation  of  its  arguments.  A  dy¬ 
namic  programming  method  called  policy  iteration  [57]  can  be  used  in  conjunction 
with  that  solution  method  to  adjust  policy  0pt7  until  it  is  truly  the  optimal  behav¬ 
ior.  This  definition  of  optimality  is  more  widely  used  than  finite-horizon  optimality 
because  its  exponential  form  makes  it  more  computationally  tractable.  It  is  also  an 
intuitively  satisfying  model,  with  slowly  diminishing  importance  attached  to  events 
in  the  distant  future. 
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Figure  9:  A  sample  deterministic  world.  The  numbers  represent  the  immediate 
reinforcement  values  that  the  agent  will  receive  when  it  is  in  each  of  the  states.  The 
only  choice  of  action  is  in  state  A. 

As  an  illustration  of  these  different  measures  of  optimality,  consider  the  world 
depicted  in  Figure  9.  In  state  A,  the  agent  has  a  choice  as  to  whether  to  go  right  or 
left;  in  all  other  states  the  world  transition  is  the  same  no  matter  what  the  agent 
does.  In  the  left  loop,  the  only  reinforcement  comes  at  the  last  state  before  state 
A,  but  it  has  value  6.  In  the  right  loop,  each  state  has  reinforcement  value  1.  Thus, 
the  average  reinforcement  is  higher  around  the  left  loop,  but  it  comes  sooner  around 
the  right  loop.  The  agent  must  decide  what  action  to  take  in  state  A.  Different 
definitions  of  optimality  lead  to  different  choices  of  optimal  action. 

Under  the  local  definition  of  optimality,  we  have  er(A,  L)  =  0  and  er(A,  R)  =  1. 
The  expected  return  of  going  left  is  0  and  of  going  right  is  1,  so  the  optimal  action 
would  be  to  go  right. 

Using  the  finite- horizon  definition  of  optimality,  which  action  is  optimal  depends 
on  the  horizon.  For  very  short  horizons,  it  is  clearly  better  to  go  right.  When  the 
horizon,  k,  is  5,  it  becomes  better  to  go  left.  A  general  rule  for  optimal  behavior  is 
that  when  in  state  A,  if  the  horizon  is  5  or  more,  go  left,  otherwise  go  right.  Figure 
10  shows  a  plot  of  the  values  of  going  left  (solid  line)  and  going  right  (dashed  line) 
initially,  assuming  that  all  choices  axe  made  optimally  thereafter.  We  can  see  that 
going  right  is  initially  best,  but  it  is  dominated  by  going  left  for  all  k  >  5. 
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Figure  10:  Plot  of  expected  return  against  horizon  k.  Solid  line  indicates  strategy 
of  going  left  first,  then  behaving  optimally.  Dashed  line  indicates  strategy  of  going 
right  first,  then  behaving  optimally.  & 


Figure  11:  Plot  of  expected  return  against  discount  factor  7.  Solid  line  indicates 
strategy  of  always  going  left.  Dashed  line  indicates  strategy  of  always  going  right. 
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Finally,  we  can  consider  discounted  expected  value.  Figure  11  shows  a  plot  of 
the  values  of  the  strategies  of  always  going  left  at  state  A  (solid  line)  and  always 
going  right  at  state  A  (dashed  line)  plotted  as  a  function  of  7.  When  there  is  a  great 
deal  of  discounting  (7  is  small),  it  is  best  to  go  right  because  the  reward  happens 
sooner.  As  7  increases,  going  left  becomes  better,  and  at  approximately  7  =  0.915, 
going  left  dominates  going  right. 

Using  a  global  optimality  criterion  can  require  agents  to  learn  that  chains  of 
actions  will  result  in  states  with  high  reinforcement  value.  In  such  situations,  the 
agent  takes  actions  not  because  they  directly  result  in  good  states,  but  because  they 
result  in  states  that  are  closer  to  the  states  with  high  payoff.  One  way  to  design 
learning  behaviors  that  attempt  to  achieve  these  difficult  kinds  of  global  optimality 
is  to  divide  the  problem  into  two  parts:  transducing  the  global  reinforcement  signal 
into  a  local  reinforcement  signal  and  learning  to  perform  the  locally  best  action. 
The  global  reinforcement  signal  is  the  stream  of  values  of  R(i(t))  that  come  from 
the  environment.  The  optimal  local  reinforcement  signal,  R(i(t)),  can  be  defined 
as  R(i(t))  +  7 er7(i(<),  Opt7(i(t)).  It  is  the  value  of  the  state  i(t)  assuming  that  the 
agent  acts  optimally.  As  shown  by  Sutton  [70],  this  signal  can  be  approximated 
by  the  value  of  the  state  i(t)  given  that  the  agent  follows  the  policy  it  is  currently 
executing.  Sutton’s  adaptive  heuristic  critic  (AHC)  algorithm,  an  instance  of  the 
general  class  of  temporal  difference  methods,  provides  a  way  of  learning  to  generate 
the  local  reinforcement  signal  from  the  global  reinforcement  signal  in  such  a  way 
that,  if  combined  with  a  correct  local  learning  algorithm,  it  will  converge  to  the 
true  optimal  local  reinforcement  values  [70,71].  A  complication  introduced  by  this 
method  is  that,  from  the  local  behavior-learner’s  point  of  view,  the  world  is  not 
stationary.  This  is  because  it  takes  time  for  the  AHC  algorithm  to  converge  and 
because  changes  in  the  behavior  cause  changes  in  the  values  of  states  and  therefore 
in  the  local  reinforcement  function.  This  and  related  methods  will  be  explored 
further  in  Chapter  9. 

The  following  discussion  will  be  in  terms  of  some  definition  of  the  optimality  of 
an  action  for  a  situation,  Opt(i,  a),  which  can  be  defined  in  any  of  the  three  ways 
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above,  or  in  some  novel  way  that  is  more  appropriate  for  the  domain  in  which  a 
particular  agent  is  working. 

2.2.2  Convergence 

Correctness  is  a  binary  criterion:  either  a  behavior  is  or  is  not  correct  for  its  world. 
Since  correctness  requires  that  the  behavior  perform  the  optimal  actions  from  the 
outset,  it  is  unlikely  that  any  “learning”  behavior  will  ever  be  correct.  Using  a 
definition  of  correctness  as  a  reference,  however,  it  is  possible  to  develop  other 
measures  of  how  close  particular  behaviors  come  to  the  optimal  behavior.  This 
section  will  consider  two  different  classes  of  methods  for  characterizing  how  good 
or  useful  a  behavior  is  in  terms  of  its  relation  to  the  optimal  behavior. 

Classical  Convergence  Measures 

Early  work  in  the  theory  of  machine  learning  was  largely  concerned  with  learning 
in  the  limit  [13,27].  A  behavior  converges  to  the  optimal  behavior  in  the  limit  if 
there  is  some  time  after  which  every  action  taken  by  the  behavior  is  the  same  as 
the  action  that  would  have  been  taken  by  the  optimal  behavior. 

Work  in  learning-automata  theory  has  relaxed  the  requirements  of  learning  in  the 
limit  by  applying  different  definitions  of  probabilistic  convergence  to  the  sequence  of 
internal  states  of  a  learning  automaton.  Following  Narendra  and  Thathachar  [53], 
the  definitions  are  presented  here.  A  learning  automaton  is  said  to  be  expedient  if 

Jim^  E[M (n)]  <  M0  , 

where  M(n)  is  the  average  penalty  (they  are  trying  to  minimize  “penalty”  rather 
than  maximize  “reinforcement” — merely  a  terminological  difference)  for  the  internal 
state  at  time  step  n  and  Mq  is  M(n)  for  the  pure-chance  automaton  that  selects 
each  action  randomly  with  a  uniform  distribution.  A  learning  automaton  is  said  to 
be  optimal  if 


Jim  E[M(n)]  =  Cl  , 
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where  q  =  mm,{c,}  and  c,  is  the  expected  penalty  of  executing  action  i.  A  learning 
automaton  is  said  to  be  e-optimal  if 

lim  E[M(n )]  <  c/  +  e 

n— >oo 

can  be  obtained  for  any  arbitrary  e  >  0  by  a  proper  choice  of  the  parameters  of  the 
automaton.  Finally,  a  learning  automaton  is  said  to  be  absolutely  expedient  if 

E[M{n  +  1)  |  s(n)]  <  M(n ) 

for  all  legal  internal  states  of  the  algorithm  s(n )  and  for  all  possible  sets  {c,}(i  = 
1,2, ...,r)  (under  the  assumption  that  environments  with  all  expected  penalties 
equal  are  excluded). 

An  important  recent  theoretical  development  is  a  model  of  Boolean-function 
learning  algorithms  that  are  probably  approximately  correct  (PAC)  [5,76],  that  is, 
that  have  a  high  probability  of  converging  to  a  function  that  closely  approximates 
the  optimal  function.  The  correctness  of  a  function  is  measured  with  respect  to  a 
fixed  probability  distribution  on  the  input  instances — a  function  is  said  to  approx¬ 
imate  another  function  to  degree  e  if  the  probability  that  they  will  disagree  on  any 
instance  chosen  according  to  the  given  probability  distribution  is  less  than  e.  This 
model  requires  that  there  be  a  fixed  distribution  over  the  input  instances  and  that 
each  input  to  the  algorithm  be  drawn  according  to  that  distribution. 

For  an  agent  to  act  effectively  in  the  world,  its  inputs  must  provide  some  infor¬ 
mation  about  the  state  that  the  world  is  in.  In  general,  when  the  agent  performs  an 
action  it  will  bring  about  a  change  in  the  state  of  the  world  and,  hence,  a  change  in 
the  information  the  agent  receives  about  the  world.  Thus,  it  will  be  very  unlikely 
that  such  an  agent’s  inputs  could  be  modeled  as  being  drawn  from  a  fixed  distribu¬ 
tion,  making  PAC-convergence  an  inappropriate  model  for  autonomous  agents. 

In  addition,  the  PAC-learning  model  is  distribution-independent — it  seeks  to 
make  statements  about  the  performance  of  algorithms  no  matter  how  the  input 
instances  are  distributed.  As  Buntine  has  pointed  out  [14],  its  predictions  are  often 
overly  conservative  for  situations  in  which  there  is  a  priori  information  about  the 
distribution  of  the  input  instances,  or  even  in  which  certain  properties  of  the  actual 
sample,  such  as  how  many  distinct  elements  it  contains,  are  known. 
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Measuring  Error  over  an  Agent’s  Lifetime 

None  of  the  classical  convergence  measures  take  into  account  the  behavior  of  the 
agent  during  the  period  in  which  it  converges.  Instead,  they  make  what  is,  for  an 
agent  embedded  in  the  world,  an  artificial  distinction  between  a  learning  phase  and 
an  acting  phase.  Autonomous  agents  that  have  extended  run  times  will  be  expected 
to  learn  for  their  entire  lifetime.  Because  they  may  not  encounter  certain  parts  or 
aspects  of  their  environments  until  arbitrarily  late  in  the  run,  it  is  inappropriate  to 
require  all  mistakes  to  be  made  before  some  fixed  deadline. 

Another  way  of  characterizing  the  performance  of  a  function-learning  algorithm 
is  to  count  the  divergences  it  makes  from  the  optimal  function.  Littlestone  [37]  has 
investigated  this  model  extensively,  characterizing  the  optimal  number  of  ‘mistakes’ 
for  a  Boolean-function  learner  and  presenting  algorithms  that  perform  very  well, 
under  this  measure,  on  certain  classes  of  Boolean  functions.  This  model  is  intuitively 
pleasing,  making  no  restrictive  division  into  learning  and  acting  phases,  but  it  is  not 
presented  as  being  suited  to  noisy  or  inconsistent  domains.  However,  by  assimilating 
the  inconsistency  of  the  domain  into  the  definition  of  the  target  function,  as  in  the 
requirement  for  optimal  behavior,  Opt,  we  can  make  use  of  mistake  bounds  in 
inconsistent  domains.  A  behavior  is  said  to  make  an  avoidable  mistake  if,  given 
some  input  instance  i,  it  generates  action  a  and  Opt(t,  a)  does  not  hold;  that  is, 
there  was  some  other  action  that  would  have  had  a  higher  expected  reinforcement. 

Avoidable  mistake  bounds  take  into  account  the  fact  that  many  mistakes  cannot 
be  avoided  by  an  agent  with  limited  sensory  abilities  and  unreliable  effectors.  How¬ 
ever,  this  measure  is  not  entirely  appropriate,  because  every  non-optimal  choice  of 
action  is  considered  to  be  a  mistake  of  the  same  magnitude.  The  expected  error  of 
an  action  a  given  an  input  t,  err(a,  i),  is  defined  to  be 

err(a,  t)  =  er(a',  *)  —  er(a,  i )  , 

in  which  a '  is  any  action  such  that  opt(a,,i).  The  expected  error  associated  with 
an  optimal  action  is  0;  for  a  non-optimal  action,  it  is  just  the  decrease  in  expected 
reinforcement  due  to  having  executed  that  action  rather  than  an  optimal  one.  The 
error  of  a  behavior,  either  in  the  limit,  or  for  runs  of  finite  length,  can  be  measured 
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by  slimming  the  errors  of  the  actions  it  generates.  This  value,  referred  to  in  the 
statistics  literature  as  the  regret  of  a  strategy  [12],  represents  the  expected  amount 
of  reinforcement  lost  due  to  executing  this  behavior  rather  than  an  optimal  one. 
This  is  an  appropriate  performance  metric  for  agents  embedded  in  inconsistent 
environments  because  it  measures  expected  loss  of  reinforcement,  which  is  precisely 
what  we  would  like  to  minimize  in  our  agents. 

In  many  situations,  the  optimal  behavior  is  unknown  or  difficult  to  compute, 
which  makes  it  difficult  to  calculate  the  error  of  a  given  behavior.  It  is  still  possi¬ 
ble  to  use  this  measure  to  compare  two  different  behaviors  for  the  same  agent  and 
environment.  The  expected  reinforcement  for  an  algorithm  over  some  time  period 
can  be  estimated  by  running  it  several  times  and  averaging  the  resulting  total  rein¬ 
forcements.  Because  expectations  axe  additive,  the  difference  between  the  expected 
errors  of  two  algorithms  is  the  same  as  the  difference  between  their  expected  total 
reinforcement  values.  Thus,  the  difference  between  average  reinforcements  is  a  valid 
measure  of  a  behavior’s  correctness  that  is  independent  of  the  internal  architecture 
of  the  algorithm  and  that  can  be  used  to  compare  results  across  a  wide  variety  of 
techniques. 

2.2.3  Time  and  Space  Complexity 

Autonomous  agents  must  operate  in  the  real  world,  continually  receiving  inputs 
from  and  performing  actions  on  their  environments.  Because  the  world  changes 
dynamically,  an  autonomous  agent  must  be  reactive — always  aware  of  and  reacting 
to  changes  in  its  environment.  To  ensure  reactivity,  an  agent  must  operate  in  real¬ 
time ;  that  is,  its  sense-compute-act  cycle  must  keep  pace  with  the  unfolding  of 
important  events  in  the  environment.  The  exact  constraints  on  the  reaction  time  of 
an  agent  are  often  difficult  to  articulate,  but  it  is  clear  that,  in  general,  unbounded 
computation  must  never  take  place. 

A  convenient  way  to  guarantee  real-time  performance  is  to  require  that  the 
behavior  spend  only  a  constant  amount  of  time,  referred  to  as  a  ‘tick,’  generating 
an  action  in  response  to  each  input.  If  the  behavior  is  a  learning  behavior,  the 
learning  process  must  also  spend  only  a  constant  amount  of  time  on  each  input 
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instance.  There  are  two  strategies  for  designing  such  a  learning  system:  incremental 
and  batch. 

An  incremental  system  processes  each  new  data  set  or  learning  instance  as  it 
arrives  as  input.  The  processing  must  be  efficient  enough  that  the  system  is  always 
ready  for  new  data  when  it  arrives.  If  new  relevant  data  can  arrive  every  tick, 
the  learning  algorithm  must  spend  only  one  constant  tick’s  worth  of  time  on  each 
instance.  The  requirement  for  incrementality  can,  theoretically,  be  relaxed  to  yield  a 
batch  system,  in  which  a  number  of  learning  instances  are  collected,  then  processed 
for  many  ticks.  As  long  as  the  learning  system  adheres  to  the  tick  discipline,  this 
process  need  not  interfere  with  the  reactiveness  of  the  rest  of  the  system.  Working 
in  batch  mode  may  limit  the  usefulness  of  the  learning  system  to  some  degree, 
however,  because  the  system  will  be  working  with  old  data  that  may  not  reflect  the 
current  situation  and  it  will  force  the  data  that  arrive  during  the  computation  phase 
to  be  ignored.  When  using  this  method,  the  input  data  must  be  sampled  with  care, 
in  order  to  avoid  statistical  distributions  of  inputs  that  do  not  reflect  those  of  the 
external  world. 

An  algorithm  can  be  said  to  be  strictly  incremental 3  if  it  uses  a  bounded  amount 
of  time  and  space  throughout  its  entire  lifetime.  This  is  in  contrast  with  such 
approaches  as  Kibler  and  Aha’s  instance-based  learning  [1],  which  is  incremental 
in  that  it  processes  one  instance  at  a  time,  but  is  not  strictly  incremental  because 
instances  sire  stored  in  a  memory  whose  size  may  increase  without  bound  For  an 
incremental  system  that  processes  one  instance  per  tick  to  perform  in  real  time,  the 
system  must  be  strictly  incremental. 

By  definition,  the  amount  of  time  a  strictly  incremental  behavior  spends  on  each 
input  does  not  vary  as  a  function  of  the  number  of  inputs  that  have  been  received. 
It  will,  however,  depend  on  the  size  of  the  input  and  the  output,  but  that  is  fixed  at 
design  time.  This  allows  the  programmer  to  know  how  long  each  tick  of  the  learning 
behavior  will  take  to  compute  on  the  available  hardware  and  to  compare  that  rate 
with  the  pace  of  events  in  the  world. 


3This  terminology  was  suggested  by  R.  Sutton. 
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Any  formalization  of  the  interaction  between  an  agent  and  its  world  will  depend 
on  the  rate  of  the  interaction;  behaviors  that  work  at  different  rates  will  essentially 
be  working  in  different  environments.  The  expected  values  of  optimal  behaviors 
for  different  reaction  rates  will  be  quite  different.  In  general,  up  to  some  minimum 
value,  the  faster  an  agent  can  interact  with  the  world,  the  better  (otherwise  the 
agent  does  not  have  time  to  avert  impending  bad  events),  so  we  should  strive  for  the 
most  efficient  algorithms  possible,  though  a  slow  algorithm  with  better  convergence 
properties  might  be  preferable  to  a  fast  algorithm  that  is  far  from  optimal. 

Complex  agents,  such  as  mobile  robots  with  a  wide  variety  of  sensors  and  ef¬ 
fectors,  will  have  a  huge  number  of  possible  inputs  and  outputs.  If  algorithms  for 
these  agents  are  to  be  practical,  they  must  have  time  and  space  complexity  that 
is  at  worst  polynomial  in  the  number  of  input  bits,  lg(|  J  |),  and  the  number  of 
output  bits,  lg(|  A  |),  rather  than  the  number  of  inputs  and  outputs.  As  we  shall 
see  in  Section  4.6,  this  will  only  be  achievable,  in  general,  by  limiting  the  class  of 
behaviors  that  can  be  learned  by  the  agent. 


2.3  Related  Foundational  Work 

The  problem  of  learning  the  structure  of  a  finite-state  automaton  from  examples 
has  been  studied  by  many  theoreticians,  including  Moore  [51],  Gold  [28]  and,  more 
recently,  Rivest  and  Schapire  [56].  This  is  a  very  difficult  problem  that  has  only 
been  studied  in  the  case  of  deterministic  automata.  If  the  entire  structure  of  the 
world  can  be  learned,  it  is  conceptually  straightforward  to  compute  the  optimal 
behavior.  It  is  important  to  note,  however,  that  learning  an  action-map  that  max¬ 
imizes  reinforcement  is  likely  to  be  much  less  complex  than  learning  the  world’s 
transition  function. 

Watkins  [78]  presents  a  clear  discussion  of  different  types  of  optimality  from  an 
operations-research  perspective  and  characterizes  possible  algorithms  for  learning 
optimal  behavior  from  delayed  rewards.  Sutton  [70,71]  shows  how  to  divide  the 
problem  of  learning  from  delayed  reinforcement  into  the  problems  of  locally  optimal 
behavior  learning  and  secondary  reinforcement-signal  learning.  The  implications 
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of  these  ideas  for  learning  from  delayed  reinforcement  will  be  explored  further  in 
Chapter  9. 

Williams  has  done  important  work  on  the  foundations  of  reinforcement  learning, 
which  is  considerably  different  than  the  framework  provided  in  this  chapter  [83,84] . 
He  has  developed  a  general  form  for  expressing  reinforcement  algorithms  in  which 
a  wide  variety  of  existing  reinforcement  learning  algorithms  may  be  described.  In 
addition,  he  has  shown  that  the  algorithms  expressed  in  this  form  are  performing 
a  gradient  ascent  search,  in  which  the  average  update  of  the  internal  parameters  of 
the  algorithm  is  in  the  direction  of  steepest  ascent  for  expected  reinforcement. 


Chapter  3 

Previous  Approaches 


The  problem  of  learning  from  reinforcement  has  been  studied  by  a  variety  of  re¬ 
searchers:  statisticians  studying  the  “two-armed  bandit”  problem,  psychologists 
working  on  mathematical  learning  theory,  learning-automata  theorists,  and  AI  re¬ 
searchers.  This  chapter  explores  the  differing  frameworks  in  which  these  groups  have 
studied  reinforcement  learning  and  presents  a  few  important  algorithms  and  results 
from  each  area.  It  presents  previous  approaches  only  to  the  simple  reinforcement- 
learning  scenario  in  which  all  reinforcement  is  instantaneous  (the  goal  is  to  optimize 
local,  immediate  reinforcement)  and  the  action  maps  to  be  learned  are  pure  func¬ 
tions.  As  these  assumptions  are  relaxed,  later  in  the  dissertation,  other  relevant 
work  pertaining  to  the  more  complex  situations  will  be  discussed. 


3.1  Bandit  Problems 

The  reinforcement  learning  problem  is  addressed  within  the  statistics  community 
as  the  “two-armed  bandit”  problem:  given  a  machine  with  two  levers  that  pays 
some  amount  of  money  each  time  a  lever  is  pulled,  develop  a  strategy  that  gains  the 
maximum  payoff  over  time  by  choosing  which  lever  to  pull  based  on  the  previous 
experience  of  lever-pulling  and  payoffs.  Among  the  early  results  was  that  the  “stick 
with  a  winner  but  switch  on  a  loser”  strategy  is  expedient  (better  than  random), 
but  not  optimal  [12]. 
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Algorithm  1  (BANDIT)  The  initial  state,  s0,  consists  of  S  components:  c,  an  array 
with  two  integer  elements,  and  integers  d  and  l.  Initially,  c  contains  zeros,  d  =  —1, 
and  1  =  0. 

u(s,a,r)=  if  d  = —1  then 

c[a]  :=  c[a ]  -f  1 
e(s)  =  if  d  =  —  1  then 

*/c[0]  —  c[l]  >  k  then  begin 
d  :=  0;  return  0;  end 
else  if  c[l]  —  c[0]  >  k  then  begin 
d  :=  1;  return  1;  end 
else  if  l  =  0  then  begin 
l  :=  1;  return  1;  end 
else  begin 

l  :=  0;  return  0;  end 
else  return  d 


Figure  12:  Formal  description  of  the  BANDIT  algorithm. 

Most  of  the  technical  results  in  this  area  make  very  strict  assumptions  about  the 
a  priori  information  the  player  has  about  the  probabilistic  models  underlying  the 
payoff  processes  of  the  two  arms.  These  results  may  be  useful  in  restricted  situations, 
but  are  not  applicable  to  the  general  problem  of  building  learning  agents. 

There  has  been  some  consideration,  however,  of  the  minimax  case,  in  which  it 
is  assumed  that  the  events  of  arm-pulling  are  independent,  that  they  pay  off  either 
nothing  or  a  fixed  amount,  that  the  probability  of  each  arm  paying  off  remains 
constant  for  the  entire  game,  and  that  the  world  will  choose  the  probabilities  in 
the  way  that  is  worst  for  the  player.  It  has  been  shown  [12]  that  the  best  possible 
strategy  for  such  a  domain  has  regret  proportional  to  (1  —  7)-1/2  for  discounting 
factor  7  and  to  n1/2  for  finite  horizon  n. 

An  example  algorithm  satisfying  these  requirements  is  formally  described  in 
Figure  12.1  The  algorithm  alternates  between  the  two  arms,  keeping  track  of  the 


1  There  i6  no  input  argument,  i,  to  the  update  and  evaluation  functions.  This  algorithm,  as  well 
as  most  of  the  others  in  the  first  part  of  the  chapter,  makes  a  choice  about  what  action  to  perform 
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number  of  successes  of  each.  When  the  number  of  successes  of  one  arm  exceeds  the 
number  of  successes  of  the  other  by  a  number  k,  it  chooses  the  winning  arm  forever 
into  the  future.  The  array  c  contains  counts  of  the  number  of  successes  of  each  arm; 
d  encodes  the  decision  about  future  actions;  if  it  has  value  —1,  the  decision  has  not 
yet  been  made;  1  encodes  the  last  action  taken  so  that  the  algorithm  can  alternate 
between  actions  in  the  pre-decision  phase.  If  reinforcement  is  to  be  optimized  over 
a  fixed  horizon  n,  k  should  be  chosen  to  be  n1/2.  If  reinforcement  with  discounting 
factor  7  is  to  be  optimized,  k  should  be  chosen  to  be  (1  —  7)-1/2.  This  is  a  simple 
algorithm  with  an  upper  bound  on  regret  of  (1  —  7)-1/2(l  +  -L)  in  the  discounted 
case  or  (1  —  n-1)~(n-1)n1/2(l  +  £)  in  the  finite  horizon  case.  This  value  is  itself 
bounded  above  by  n1^2(e  + 1/2).  In  both  cases,  the  upper  bound  on  regret  is  within 
a  constant  factor  of  optimal.  However,  as  we  will  see  in  Section  4.4,  this  algorithm 
is  outperformed  by  many  others  in  empirical  tests. 


3.2  Learning  Automata 

Another  closely  related  field  is  that  of  learning  automata.  The  phrase  “learning 
automata”  means,  in  this  case,  automata  that  learn  to  act  in  the  world,  as  opposed 
to  automata  that  learn  the  state-transition  structures  of  other  automata  (as  in 
Moore  [51]). 

3.2.1  Early  Work 

The  first  work  in  this  area  took  place  in  the  Soviet  Union.  An  example  of  early 
learning-automaton  work  is  the  Tsetlin  automaton,  designed  by  M.  L.  Tsetlin  [75]. 
The  input  set  of  the  automaton  is  {0,1},  with  1  corresponding  to  the  case  when 
the  agent  receives  reinforcement  and  0  corresponding  to  the  case  when  it  does  not. 
As  in  the  BANDIT  algorithm,  there  is  no  input  corresponding  to  t,  the  information 
about  the  state  of  the  world.  The  automaton  has  two  possible  actions,  or  outputs: 
0  and  1.  The  operation  of  the  Tsetlin  automaton  is  described  in  Figure  13. 


for  every  future  time  step,  with  only  reinforcement  as  input. 
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Algorithm  2  (TSETLIN) 


a  =  0  cl  —  1 

. o«  o*** o« — o  o  k>»**o  »o — 

1  2  3  N-l  N  2N  2N-1  N+3  N+2  N+l 


•  ••  u — M  lj+ — u  •  ••  O* — CH— O 
1  2  3  N-l  N*2N  2N-1  N+3  N+2  N+l 

r=0 


The  initial  state  can  be  any  of  the  states ,  but  would  most  reasonably  be  chosen  to 
be  state  N  or  state  2 N.  All  of  the  states  on  the  left  half  of  the  graph  evaluate  to 
action  0  and  on  the  right  half  of  the  graph  to  action  1.  The  state  update  operation 
consists  of  making  one  of  the  labeled  transitions:  when  reinforcement  has  value  1, 
a  transition  to  the  left  is  taken  if  the  action  was  0  and  to  the  right  if  the  action  was 
1;  when  the  reinforcement  has  value  0,  a  right  transition  is  taken  if  the  action  was 
0  and  a  left  transition  if  the  action  was  1.  Zero  reinforcement  values  move  the  state 
toward  the  center  and  positive  reinforcement  values  move  the  state  toward  the  end 
corresponding  to  the  action  that  was  taken. 


Figure  13:  The  Tsetlin  automaton 
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The  Tsetlin  automaton  is  parametrizable  by  the  number,  N,  of  states  between 
the  center  state  and  the  ends  of  the  chains  going  to  the  right  and  left.  It  can  be 
shown  that,  if  one  of  the  actions  has  success  probability  greater  than  .5,  then,  as  the 
value  N  approaches  infinity,  the  average  reinforcement  approaches  the  maximum 
success  probability  [53]. 

There  are  many  other  similar  learning  automata,  some  with  better  convergence 
properties  than  this  one.  The  BANDIT  algorithm  can  also  be  easily  modeled  as  a 
finite-state  machine. 


3.2.2  Probability- Vector  Approaches 

As  it  is  difficult  to  conceive  of  complex  algorithms  in  terms  of  finite-state  transition 
diagrams,  the  learning  automata  community  moved  to  a  new  model,  in  which  the 
internal  state  of  the  learning  algorithm  is  a  vector  of  non-negative  numbers  that 
sum  to  1.  The  length  of  the  vector  corresponds  to  the  number  of  possible  actions  of 
the  agent.  The  agent  chooses  an  action  probabilistically,  with  the  probability  that 
it  chooses  the  nth  action  equal  to  the  nth  element  of  the  state  vector.  The  problem, 
then,  is  one  of  updating  the  values  in  the  state  vector  depending  on  the  most  recent 
action  and  its  outcome. 

These  and  similar,  related  models  were  also  independently  developed  by  the 
mathematical  psychology  community  [15]  as  models  for  human  and  animal  learning. 

The  most  common  of  these  approaches,  called  the  linear  reward-penalty  algo¬ 
rithm,  is  shown  in  Figure  14.  Whenever  an  action  is  chosen  and  succeeds,  the 
probability  of  performing  that  action  is  increased  in  proportion  to  1  minus  its  cur¬ 
rent  probability;  when  an  action  is  chosen  and  fails,  the  probability  of  performing 
the  other  action  is  increased  in  proportion  to  its  current  probability.  The  parame¬ 
ters  a  and  b  govern  the  amount  of  adjustment  upon  success  and  failure,  respectively. 
An  important  specialization  is  the  linear  reward-inaction  algorithm,  also  described 
in  Figure  14,  in  which  no  adjustment  is  made  to  the  probability  vector  when  rein¬ 
forcement  value  0  is  received. 
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Algorithm  3  ( Lrp )  The  initial  state,  s0,  consists  of  pi  and  p^,  two  positive  real 
numbers  such  that  Pi  +  P2  =  1. 

u(s,a,r)  =  if  a  =  0  then 

if  r  =  0  then 

Po  :=  (1  -  b)po 
else  po:=  po  +  api 

else 

ifr  =  0  then 

Po  :=  po  +  bp1 

else  po  :=  (1  -  a)po 

Pi  :=1-  po 

,  v  _  JO  with  probability  po 

'  '  [1  with  probability  pi 

Algorithm  4  ( Lri )  Any  instance  of  Algorithm  Lrp  in  which  6  =  0. 

Figure  14:  The  linear  reward-penalty  (Lrp)  and  linear  reward-inaction  (Lri)  algo¬ 
rithms. 

The  linear  reward-penalty  algorithm  has  asymptotic  performance  that  is  better 
than  random  (that  is,  it  is  expedient),  but  it  is  not  optimal.  It  has  no  absorbing 
states,  so  it  always  executes  the  wrong  action  with  some  non-zero  probability.  The 
linear  reward- inaction  algorithm,  on  the  other  hand,  has  the  absorbing  states  [1,0] 
and  [0,1],  because  a  probability  is  only  ever  increased  if  the  corresponding  action 
is  taken  and  it  succeeds.  Once  one  of  the  probabilities  goes  to  0,  that  action  will 
never  be  taken,  so  its  probability  can  never  be  increased.  The  linear  reward-inaction 
algorithm  is  e-optimal;  that  is,  the  parameter  a  can  be  chosen  in  order  to  make  the 
probability  of  converging  to  the  wrong  absorbing  state  as  small  as  desired.  As  the 
value  of  a  is  decreased,  the  probability  of  converging  to  the  wrong  state  is  decreased; 
however,  the  rate  of  convergence  is  also  decreased.  Theoreticians  have  been  unable 
to  derive  a  general  formula  that  describes  the  probability  of  convergence  to  the 
wrong  state  as  a  function  of  a  and  the  initial  value  of  pi .  This  would  be  necessary 
in  order  to  choose  a  to  optimize  reinforcement  for  runs  of  a  certain  length  or  with 
a  certain  discounting  factor,  as  we  did  with  k  in  the  BANDIT  algorithm  above. 
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Algorithm  5  (ts)  The  initial  state,  Sq,  consists  of  the  following  6  components:  po 
and  fo,  which  are  positive  real  numbers  such  that  Po  +  and  Rq  =  Ri  =  Z$  — 

Zi  =  0. 

u(s,a,r)  =  d0  :=  Ro/Z0)  dx  :=  Ri/Zi 
if  a  =  0  then  begin 
if  d0  >  di  then 

po  :=Po  +  X(d0-di)pi 

A  A 

else  po  :=  po  +  A(d0  -  dx)pl 
Pi  •=  1  -  Po 
Rq  :=  i?o  +  r 
.Zo  ^  Zq  +  1 
end  else  begin 

«A  * 

if  di  >  do  then 

Pi'=Pi  +  Kdi  -  do)po 
else  pi  :=  pi  +  X(di  -  do)pl 
Po  :=  1  -  Pi 
iii  :=  Ri  +  r 

^i  :=  Zi  +  1 

end 

e(s)  —  f  0  probability  po 

'  1  probability  pi 

where  0  <  A  <  1  is  a  positive  constant. 

Figure  15:  The  TS  algorithm 


In  addition  to  these  Unear  approaches,  a  wide  range  of  non-Hnear  approaches 
have  been  proposed.  One  of  the  most  promising  is  Thathachar  and  Sastry’s  method 
[74].  It  is  shghtly  divergent  in  form  from  the  previous  algorithms  in  that  it  keeps 
more  state  than  simply  the  vector  p  of  action  probabilities.  In  addition,  there  is 
a  vector  d  of  estimates  of  the  expected  reinforcements  of  executing  each  action. 
Reinforcement  values  are  assumed  to  be  real  values  in  the  interval  [0,1],  A  simple 
two-action  version  of  this  algorithm  is  shown  in  Figure  15. 

The  Rj  are  the  summed  reinforcement  values  for  each  action,  the  Zj  are  the 
number  of  times  each  action  has  been  tried,  and  the  d:  are  the  average  reinforcement 
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values  for  each  action.  The  adjustment  to  the  probability  vector  depends  on  the 
values  of  the  dj  rather  than  on  the  direct  results  of  recent  actions.  This  introduces 
a  damping  effect,  because  as  long  as,  for  instance,  do  >  di,  Po  will  be  increased, 
even  if  it  has  a  few  negative-reinforcement  results  during  that  time. 

The  TS  algorithm  converges  much  faster  than  the  linear  algorithms  Lrp  and 
Lri.  One  of  the  reasons  may  be  that  it  naturally  takes  big  steps  in  the  parameter 
space  when  the  actions  are  well  differentiated  (the  difference  between  do  and  d\  is 
large)  and  small  steps  when  they  are  not.  It  has  been  shown  that,  for  any  stationary 
random  environment,  there  is  some  value  of  A  such  that  p/(n)  — »  1  in  probability2  as 
n  — ►  oo,  where  p/(n)  is  the  probability  of  executing  the  action  that  has  the  highest 
expected  reinforcement  [74]. 


3.3  Reinforcement- Comparison  Methods 

One  drawback  of  most  of  the  algorithms  that  have  been  presented  so  far  is  that 
reinforcement  values  of  0  and  1  cause  the  same  sized  adjustment  to  the  internal 
state  independent  of  the  expected  reinforcement  value.  Sutton  [70]  addressed  this 
problem  with  a  new  class  of  algorithms,  called  reinforcement- comparison  methods. 
These  methods  work  by  estimating  the  expected  reinforcement,  then  adjusting  the 
internal  parameters  of  the  algorithm  proportional  to  the  difference  between  the 
actual  and  estimated  reinforcement  values.  Thus,  in  an  environment  that  tends  to 
generate  reinforcement  value  1  quite  frequently,  receiving  the  value  1  will  cause  less 
adjustment  that  will  be  caused  by  receiving  the  value  0. 

An  instance  of  the  reward-comparison  method,  taken  from  Sutton’s  thesis  [70],  is 
shown  in  Figure  16.  The  internal  state  consists  of  the  “weight”  w,  which  is  initialized 
to  0,  and  the  predicted  expected  reinforcement,  p,  which  is  initialized  to  the  first 
reinforcement  value  received.  The  output,  e(s),  has  value  1  or  0  depending  on  the 
values  of  w  and  the  random  variable  v.  The  addition  of  the  random  value  causes 
the  algorithm  to  “experiment”  by  occasionally  performing  actions  that  it  would  not 

2  According  to  Narendra  and  Thathachar  [53],  “The  sequence  {An}  of  random  variables  converges 
in  probability  to  the  random  variable  X  if  for  every  c  >  0,  limn_00  Pr{|  Xn  —  X  |>  e}  =  0.” 
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Algorithm  6  (rc)  The  internal  state,  s0)  consists  of  the  values  w  =  0  and  p, 
which  will  be  initialized  to  the  first  reinforcement  value  received. 

u(s,a,r)=  w  :=  w  +  a(r  —  p)(a  —  1/2) 

P  :=  P  +  fi(r  -  p ) 
efs\=  [  1  if  w  +  t/>  0 

'  '  [0  otherwise 

where  a  >  0,  0  <  fi  <  1,  and  v  is  a  normally  distributed  random  variable  of  mean 
0  and  standard  deviation  Sy. 

Figure  16:  A  reward-comparison  (rc)  algorithm. 

otherwise  have  taken.  The  state  component  w  is  incremented  by  a  value  with  three 
terms.  The  first  term,  a,  is  a  constant  that  represents  the  learning  rate.  The  next 
term,  r  -  p,  represents  the  difference  between  the  actual  reinforcement  received  and 
the  predicted  reinforcement,  p.  This  serves  to  normalize  the  reinforcement  values: 
the  absolute  value  of  the  reinforcement  signal  is  not  as  important  as  its  value  relative 
to  the  average  reinforcement  that  the  agent  has  been  receiving.  The  third  term  in 
the  update  function  for  it;  is  a  —  1/2;  it  has  constant  absolute  value  and  the  sign 
is  used  to  encode  which  action  was  taken.  The  predicted  reinforcement,  p,  is  a 
weighted  running  average  of  the  reinforcement  values  that  have  been  received. 

3.4  Associative  Methods 

The  algorithms  presented  so  far  have  addressed  the  case  of  reinforcement  learning  in 
environments  that  present  only  reinforcement  values  as  input  to  the  agent.  A  more 
general  setting  of  the  problem,  called  associative  reinforcement  learning ,  requires 
the  agent  to  learn  the  best  action  for  each  of  a  possibly  large  number  of  input 
states.  This  section  will  describe  three  general  approaches  for  converting  simple 
reinforcement-learning  algorithms  to  work  in  associative  environments.  The  first 
is  a  simple  copying  strategy,  and  the  second  two  are  instances  of  a  large  class  of 


44 


CHAPTER  3.  PREVIOUS  APPROACHES 


Algorithm  7  (COPY)  Let  (so,u,e)  be  a  learning  behavior  that  has  only  reinforce¬ 
ment  as  input.  We  can  construct  a  new  learning  behavior  (s'0,u',e')  with  2M  inputs 
as  follows: 


So  =  array  of  $0 

u’(s',i,a,r)  =  u(s'[z],a,r) 

eV»*)  =  «(*'[*].«) 

Figure  17:  Constructing  an  associative  algorithm  by  making  copies  of  a  non- 
associative  algorithm. 

associative  reinforcement-learning  methods  developed  by  researchers  working  in  the 
connectionist  learning  paradigm.  Other  approaches  not  described  here  include  those 
of  Minsky  [48]  and  Widrow,  Gupta,  and  Maitra  [81].  Barto  [9]  gives  a  good  overview 
of  connectionist  learning  for  control,  including  learning  from  reinforcement. 


3.4.1  Copying 

The  simplest  method  for  constructing  an  associative  reinforcement-learner,  shown 
in  Figure  17,  consists  of  making  a  copy  of  the  state  of  the  no-input  version  of  the 
algorithm  for  each  possible  input  and  training  each  copy  separately.  It  requires  2M 
(the  number  of  different  input  states)  times  the  storage  of  the  original  algorithm. 

In  addition  to  being  very  computationally  complex,  the  copying  method  does 
not  allow  for  any  generalization  between  input  instances:  that  is,  the  agent  cannot 
take  advantage  of  the  intuition  that  “similar”  situations  require  “similar”  responses. 


3.4.2  Linear  Associators 

In  his  thesis  [70],  Sutton  gives  methods  for  converting  standard  reinforcement- 
learning  algorithms  to  work  in  an  associative  setting  in  a  way  that  allows  an  agent 
to  learn  efficiently  and  to  generalize  across  input  states.  He  uses  a  version  of  the 
Widrow-Hoff  or  Adaline  [82]  weight-update  algorithm  to  associate  different  internal 
state  values  with  different  input  situations.  This  approach  is  illustrated  by  the  LARC 
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Algorithm8  (larc)  The  input  is  represented  as  an  M -dimensional  vector  i.  The 
internal  state,  s0,  consists  of  two  M -dimensional  vectors,  v  and  w. 

u(s,i,a,  r)  =  let  p  :=  v  •  i 

for  j  =  1  to  M  do  begin 

Wj  :=  Wj  +  a(r  -  p)(a  -  1/2 )ij 
vj  :=  Vj  +  fi(r  -  p)ij 

end 

*'*>-  {l  iz^r0 

where  a  >  0,  0  <  /?  <  1,  and  u  is  a  normally  distributed  random  variable  of  mean 
0  and  standard  deviation  6y. 

Figure  18:  The  linear-associator  reinforcement-comparison  (larc)  algorithm. 


(linear-associator  reinforcement-comparison)  algorithm  shown  in  Figure  18.  It  is  an 
extension  of  the  RC  algorithm  to  work  in  environments  with  multiple  input  states. 


The  inputs  to  the  algorithm  are  represented  as  M-dimensional  vectors  The  out¬ 
put,  e(s,  t),  has  value  1  or  0  depending  on  the  inner  product  of  the  weight  vector 
w  and  i  and  the  value  of  the  random  variable  v.  The  updating  of  the  vector  w  is 
somewhat  complicated:  each  component  is  incremented  by  a  value  with  four  terms. 
The  first  term,  a,  is  a  constant  that  represents  the  learning  rate.  The  next  term, 
r  ~  Pi  represents  the  difference  between  the  actual  reinforcement  received  and  the 
predicted  reinforcement,  p.  The  predicted  reinforcement,  p,  is  generated  using  a 
standard  linear  associator  that  learns  to  associate  input  vectors  with  reinforcement 
values  by  setting  the  weights  in  vector  v.  The  third  term  in  the  update  function 
for  w  is  a  -  1/2:  it  has  constant  absolute  value  and  the  sign  is  used  to  encode 
which  action  was  taken.  The  final  term  is  ij,  which  causes  the  jth  component  of 
the  weight  vector  to  be  adjusted  in  proportion  to  the  jth  value  of  the  input. 
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Another  instance  of  the  lineax-associator  approach  is  Barto  and  Anandan’s  as¬ 
sociative  reward-penalty  {Arp)  algorithm  [7].  It  is  a  hybrid  of  the  linear  reward- 
penalty  and  lineax-associator  methods  and  was  shown  (under  a  number  of  restric¬ 
tions,  including  the  restriction  that  the  set  of  input  vectors  be  linearly  independent) 
to  be  e-optimal. 

The  linear-associator  approach  can  be  applied  to  any  of  the  learning  algorithms 
whose  internal  state  consists  of  one  or  a  small  number  of  independent ly-interpret able 
numbers  for  each  input.  If  the  input  set  is  encoded  by  bit  strings,  the  linear- 
associator  approach  can  achieve  an  exponential  improvement  in  space  over  the  copy 
approach,  because  the  size  of  the  state  of  the  lineax-associator  is  proportional  to  the 
number  of  input  bits  rather  than  to  the  number  of  inputs.  This  algorithm  works  well 
on  simple  problems,  but  algorithms  of  this  type  are  incapable  of  learning  functions 
that  are  not  linearly  separable  [47]. 

3.4.3  Error  Backpropagation 

To  remedy  the  limitations  of  the  linear-associator  approach,  multi-layer  connection- 
ist  learning  methods  have  been  adapted  to  reinforcement  learning.  Anderson  [3], 
Werbos  [79],  and  Munro  [52],  among  others,  have  used  error  back-propagation 
methods3  with  hidden  units  in  order  to  allow  reinforcement-learning  systems  to 
learn  more  complex  action  mappings.  Williams  [85]  presents  an  analysis  of  the  use 
of  backpropagation  in  associative  reinforcement-learning  systems.  He  shows  that  a 
class  of  reinforcement-learning  algorithms  that  use  back-propagation  (an  instance 
of  which  is  given  below)  perform  gradient  ascent  search  in  the  direction  of  maximal 
expected  reinforcement.  This  technique  is  effective  and  allows  considerably  more 
generalization  across  input  states,  but  it  requires  many  more  presentations  of  the 
data  in  order  for  the  internal  units  to  converge  to  the  features  that  they  need  to 
detect  in  order  to  compute  the  overall  function  correctly.  Barto  and  Jordan  [10] 
demonstrate  the  use  of  a  multi-layer  version  of  the  associative  reward-penalty  algo¬ 
rithm  to  learn  non-linear  functions.  This  method  is  argued  to  be  more  biologically 

3A  good  description  of  error  back-propagation  for  supervised  learning  is  given  by  Rumelhart, 
Hinton,  and  Williams  [58]. 
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plausible  than  back-propagation,  but  requires  considerably  more  presentations  of 
the  data. 

As  an  example  of  the  application  of  error  backpropagation  methods  to  rein¬ 
forcement  learning,  Anderson’s  method  [3]  will  be  examined  in  more  detail.  It  uses 
two  networks:  one  for  learning  to  predict  reinforcement  and  one  for  learning  which 
action  to  take.  The  weights  in  the  action  network  are  updated  in  proportion  to 
the  difference  between  actual  and  predicted  reinforcement,  making  this  an  instance 
of  the  reinforcement-comparison  method  (discussed  in  Section  3.3  above).  Each  of 
the  networks  has  two  layers,  with  all  of  the  hidden  units  connected  to  all  of  the 
inputs  and  all  of  the  inputs  and  hidden  units  connected  to  the  outputs.  The  system 
was  designed  to  work  in  worlds  with  delayed  reinforcement  (which  are  discussed 
at  greater  length  in  Chapter  9),  but  it  is  easily  simplified  to  work  in  our  simpler 
domain. 

The  BP  algorithm  is  shown  in  Figures  19  and  20  and  is  explained  in  detail  by 
Anderson  [3].  The  presentation  here  is  simplified  in  a  number  of  respects,  however. 
In  this  version,  there  is  no  use  of  momentum  and  the  term  (a  —  1/2)  is  used  to 
indicate  the  choice  of  action  rather  than  the  more  complex  expression  used  by 
Anderson.  Also,  Anderson  uses  a  different  distribution  for  the  random  variable  v. 

This  method  is  theoretically  able  to  learn  very  complex  functions,  but  tends  to 
require  many  training  instances  before  it  converges.  The  time  and  space  complexity 
for  this  algorithm  is  O(MH),  where  M  is  the  number  of  input  bits  and  H  is  the 
number  of  hidden  units.  Also,  this  method  is  somewhat  less  robust  than  the  more 
standard  version  of  error  back-propagation  that  learns  from  I/O  pairs,  because  the 
error  signal  generated  by  the  reinforcement-learning  system  is  not  always  correct. 


3.5  Genetic  Algorithms 

Genetic  algorithms  constitute  a  considerably  different  approach  to  the  design  and 
implementation  of  reinforcement-learning  systems.  This  section  will  briefly  describe 
the  general  approach  and  point  to  some  representative  applications  of  these  methods 
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Algorithm  9  (BP)  The  input  is  represented  as  an  M  +  1-dimensional  vector  i,  in 

which  the  last  element  contains  a  constant  value.  The  internal  state,  Sq,  consists  of 

Weh  !  Weights  of  the  hidden  units  in  the  evaluation  network,  an  H  by  M  +  1 
element  array  initialized  to  small  random  values. 

Weo  :  Weights  of  the  output  unit  in  the  evaluation  network,  an  H  +  M  +  1  element 
array  initialized  to  small  random  values. 

Wah  s  Weights  of  the  hidden  units  in  the  action  network,  an  H  by  M  + 1  element 
array  initialized  to  small  random  values. 

Wao  5  Weights  of  the  output  unit  in  the  action  network,  an  H  +  M  +  1  element 
array  initialized  to  small  random  values. 

In  addition,  the  algorithm  makes  use  of  the  following  local  variables 

Oeh  :  Outputs  of  the  hidden  units  in  the  evaluation  network,  an  H  element  array. 

Oah  ‘  Outputs  of  the  hidden  units  in  the  action  network,  an  H  element  array. 

p  :  Output  of  the  output  unit  in  the  evaluation  network. 


Figure  19:  An  application  of  error  backpropagation  to  reinforcement  learning:  data 
structures. 


3.5.  GENETIC  ALGORITHMS 


49 


Algorithm  9  (bp)  (continued) 

u(s,i,  a,r)  =  for  j  =  1  to  H  do 

Oeh[]\  :=  /(i  •  WEH[j\) 
p  :=  WEO  •  concat(i,  0Eh) 
for  j  =  1  ioM  +  1  do 

WEO\j]  :=  Wsolj]  +  /?  (r  -  p)  i\j] 
for  j  =  1  to  H  do 

WEo[j  +  M  +  1]  :=  WEo[j  +  M  +  1]  +  /?  (r  -  p)  0Eh\j] 
for  j  =  1  to  H  do  begin 

d:=(r-  p)  s\gn(WEO\j  +  M  +  1])  0EH[j ]  (1  -  0EH[ji\) 

for  k  =  1  to  M  +  1  do 
WEH[j,  k]:=  fihd  i[fc] 

end 

for  j  =  1  to  M  +  1  do 

WAO[)]  :=  WA0[j]  +  p(r-p)(a-  1/2)  *[7] 
for  j  =  1  to  H  do 

WAO[j  +  M  +  1]  :=  WAO[j  +  M  +  1]  +  p  (r  -  p)  (a  -  1/2)  0AE[j ] 
for  j  =  1  to  H  do  begin 

d  :=  (r  —  p)  ( a  -  1/2)  sign(W^0[i  +  M  +  lj)  0AH\j]  (1  -  0AH[j)) 
for  k  =  1  to  M  +  1  do 
WAff[j,  k]  :=  d  *[fc] 

end 

e(s,i)  =  for  j  =  1  to  H  do 

0AH[j]:=F(i-WAH[j ]) 
fit/  (W/40  *  concat(t,  O^h))  +  v  >  0 
(  0  otherwise 

where  P,0h,p,ph  >  0,  f(x)  =  1/(1  +  e~x),  and  v  is  a  normally  distributed  random 
variable  of  mean  0  and  standard  deviation  Sy. 

Figure  20:  An  application  of  error-backpropagation  to  reinforcement  learning:  up¬ 
date  and  evaluation  functions 


50 


CHAPTER  3.  PREVIOUS  APPROACHES 


to  reinforcement  learning.  An  excellent  introduction  to  and  survey  of  this  field  is 
given  in  Goldberg’s  book  [29]. 

In  their  purest  form,  genetic  algorithms  (GA’s)  can  be  seen  as  a  technique  for 
solving  optimization  problems  in  which  the  elements  of  the  solution  space  are  coded 
a s  binary  strings  and  in  which  there  is  a  scalar  objective  function  that  can  be  used  to 
compute  the  “fitness”  of  the  solution  represented  by  any  string.  The  GA  maintains 
a  “population”  of  strings,  which  are  initially  chosen  randomly.  The  fitness  of  each 
member  of  the  population  is  calculated.  Those  with  low  fitness  values  are  eliminated 
and  members  with  high  fitness  values  are  reproduced  in  order  to  keep  the  population 
at  a  constant  size.  After  the  reproduction  phase,  operators  are  applied  to  introduce 
variation  in  the  population.  Common  operators  are  crossover  and  mutation.  In 
crossover,  two  population  elements  are  chosen,  at  random,  as  operands.  They  are 
recombined  by  randomly  choosing  an  index  into  the  string  and  making  two  new 
strings,  one  that  consists  of  the  first  part  of  the  first  string  and  the  second  part  of 
the  second  string  and  one  that  consists  of  the  first  part  of  the  second  string  and  the 
second  part  of  the  first  string.  Mutation  simply  changes  bits  in  population  elements, 
with  very  low  probability. 

A  more  complex  type  of  GA  is  the  classifier  system  [33].  Developed  by  Holland, 
it  consists  of  a  population  of  production  rules,  which  are  encoded  as  strings.  The 
rules  can  be  executed  to  implement  an  action  function  that  maps  external  inputs 
to  external  actions.  When  the  rules  chain  forward  to  cause  an  external  action, 
a  reinforcement  value  is  received  from  the  world.  Holland  developed  a  method, 
called  the  Bucket  Brigade,  for  propagating  reinforcement  back  along  the  chain  of 
production  rules  that  caused  the  action.  This  method  is  an  instance  of  the  class  of 
temporal  difference  methods,  which  will  be  discussed  further  in  Chapter  9.  As  a  set 
of  rules  is  run,  each  rule  comes  to  have  a  relatively  stable  value  which  is  used  as  its 
fitness.  The  standard  genetic  operations  of  reproduction,  crossover,  mutation,  etc., 
are  used  to  generate  new  populations  of  rules  from  old  ones. 

Although  classifier  systems  are  reinforcement-learners,  they  are  not  well-suited 
for  use  in  embedded  systems.  As  with  most  production  systems,  there  is  no  bound 
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on  the  number  of  rule-firings  that  will  be  required  to  generate  an  output  in  response 
to  an  input,  preventing  the  algorithm’s  operation  from  being  real-time. 

Grefenstette  [30]  has  applied  GA  methods  directly  to  the  time-constrained  prob¬ 
lem  of  learning  action  strategies  from  reinforcement.  The  elements  of  the  population 
of  his  system  are  symbolic  representations  of  action  maps.  The  fitness  of  an  element 
is  determined  by  executing  it  in  the  world  for  a  number  of  ticks  and  measuring  the 
average  reinforcement.  Action  maps  that  perform  well  are  reproduced  and  recom¬ 
bined  to  generate  new  action  maps. 

The  GA  approach  works  well  on  problems  that  can  be  effectively  coded  as  syn¬ 
tactic  objects  in  which  the  interpretation  of  individual  elements  is  relatively  context- 
independent  and  for  which  there  are  useful  recombination  operators.  It  is  not  yet 
clear  what  classes  of  problems  can  be  so  specified.  An  interesting  extension  of  the 
research  carried  out  in  this  dissertation  would  be  to  implement  genetic  algorithms 
for  the  problems  considered  and  compare  their  performance  with  that  of  the  algo¬ 
rithms  tested  herein. 


3.6  Extensions  to  the  Model 


The  algorithms  of  the  previous  sections  have  been  presented  in  their  simplest  possi¬ 
ble  forms,  with  only  Boolean  reinforcement  as  input  and  with  two  possible  actions. 
It  is  a  relatively  simple  matter  to  extend  all  of  the  algorithms  except  RC,  LARC,  and 
BP  to  the  case  of  multiple  actions.  Because  the  details  differ  for  each  one,  however, 
they  shall  be  omitted  from  this  discussion.  The  algorithms  that  choose  an  action  by 
comparing  an  internal  value  plus  noise  to  a  threshold  are  more  difficult  to  generalize 
in  this  way. 

The  rest  of  this  section  will  briefly  detail  extensions  of  these  algorithms  to  work 
in  domains  with  non-Boolean  and  nonstationary  reinforcement. 
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3.6.1  Non-Boolean  reinforcement 

Algorithms  BANDIT  and  TSETLIN  have  no  obvious  extensions  to  the  case  of  non- 
Boolean  reinforcement. 

The  learning- automata  community  considers  three  models  of  reinforcement:  P, 
Q,  and  S.  The  P-model  of  reinforcement  is  the  Boolean-reinforcement  model  we 
have  already  explored.  In  the  Q-model,  reinforcement  is  one  of  a  finite  number 
of  possible  values  that  axe  known  ahead  of  time.  These  reinforcement  values  can 
always  be  scaled  into  values  in  the  interval  [0, 1].  Finally,  the  5-model  allows  real¬ 
valued  reinforcement  in  the  interval  [0, 1].  The  notions  of  expediency  and  optimality 
can  be  extended  to  apply  to  the  Q-  and  5-models. 

Algorithms  designed  for  P-model  environments,  such  as  the  Lrp  and  Lju  algo¬ 
rithms,  can  be  adjusted  to  work  in  Q-  and  5-models  as  follows.  Let  A,to  be  the 
change  made  to  action-probability  i  when  reinforcement  0  is  received  and  let  Atji 
be  the  change  made  when  reinforcement  value  1  is  received.  We  can  define,  for  the 
new  models,  A,->r,  the  change  made  when  reinforcement  value  r  is  received  as 

A =  r Ajfi  -t-  (1  r)A,to  ^ 

a  simple  linear  combination  of  the  updates  for  the  old  reinforcement  cases  [53]. 

Algorithm  TS  was  designed  to  work  in  an  5-model  of  reinforcement  and  can  be 
used  in  such  environments  without  change.  Algorithm  RC,  as  well  as  the  associative 
reinforcement-comparison  algorithms  LARC  and  BP,  work  in  the  more  general  case 
of  real- valued  reinforcement  that  is  not  necessarily  scaled  to  fall  in  the  interval  [0, 1]. 

3.6.2  Nonstationary  environments 

A  world  is  nonstationary  if  er(i,  a)  (the  expected  reinforcement  of  performing  action 
a  in  input  situation  i)  varies  over  time.  It  is  very  difficult  to  prove  formal  results 
about  the  performance  of  learning  algorithms  in  nonstationary  environments,  but 
several  observations  can  be  made  about  which  algorithms  are  likely  to  perform 
better  in  such  environments.  For  instance,  algorithms  with  absorbing  states,  such 
as  BANDIT  and  Lri,  axe  inappropriate  for  nonstationary  environments:  if  the  world 
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changes  after  the  algorithm  has  converged,  it  will  never  sample  the  other  actions  and 
adjust  its  behavior  to  the  changed  environment.  On  the  other  hand,  algorithms  that 
are  less  effective  in  stationary  environments,  such  as  TSETLIN  and  Lrp,  continue  to 
sample  all  of  the  actions  and  will  adapt  to  changes  in  the  environment. 

3.7  Conclusions 

A  number  of  effective  reinforcement-learning  algorithms  have  been  developed  by 
different  research  communities.  The  work  in  this  dissertation  seeks  to  extend  and 
improve  upon  the  previous  work  by  developing  more  effective  learning  methods 
and  by  finding  approaches  to  associative  reinforcement  learning  that  are  capable  of 
learning  a  broader  class  of  functions  than  the  linear  approaches  can,  but  doing  so 
more  space-efficiently  than  the  copy  method  and  with  fewer  input  instances  than 
are  required  by  the  error  backpropagation  method.  In  addition,  this  dissertation 
will  extend  previous  work  on  the  problem  of  learning  from  delayed  reinforcement. 


Chapter  4 


Interval  Estimation  Method 


The  interval  estimation  method  is  a  simple  statistical  algorithm  for  reinforcement 
learning.  It  is  a  logical  extension  of  the  statistical  algorithms  presented  in  the 
previous  chapter.  By  allowing  the  state  of  the  algorithm  to  encode  not  only  esti¬ 
mates  of  the  relative  merits  of  the  various  actions,  but  also  the  degree  of  confidence 
that  we  have  in  those  estimates,  the  interval  estimation  method  builds  on  previous 
approaches  by  making  it  easier  to  control  the  tradeoff  between  acting  to  gain  infor¬ 
mation  and  acting  to  gain  reinforcement  in  a  careful  way.  The  interval  estimation 
algorithm  performs  well  on  a  variety  of  tasks  and  its  basis  in  standard  statistical 
methods  makes  it  an  illustrative  example  for  formal  analysis. 

This  chapter  presents  the  algorithm,  together  with  an  estimate  of  its  expected 
error  and  experimental  comparisons  with  many  of  the  algorithms  of  Chapter  3. 
Next,  it  explores  ways  of  extending  the  basic  algorithm  to  deal  with  the  more  gen¬ 
eral  learning  models  presented  in  Section  3.6.  Finally,  this  chapter  discusses  the 
computational  complexity  of  the  interval-estimation  algorithm  and  argues  that  it, 
along  with  other  existing  reinforcement-learning  algorithms  to  which  the  linear- 
association  or  backpropagation  methods  cannot  be  directly  applied,  is  too  compu¬ 
tationally  expensive  for  use  in  embedded  systems. 


55 


56 


CHAPTER  4.  INTERVAL  ESTIMATION  METHOD 


4.1  Description  of  the  Algorithm 

The  interval  estimation  method  can  be  applied  in  a  wide  variety  of  environments; 
the  simplest  form  will  he  presented  first,  and  extensions  to  the  basic  algorithm  will 
be  described  in  Section  4.5.  The  basic  interval  estimation  algorithm  is  formally 
described  in  Figure  21.  The  state  consists  of  simple  statistics:  for  each  action 
a,  na  and  xa  are  the  number  of  times  that  the  action  has  been  executed  and  the 
number  of  those  times  that  have  resulted  in  reinforcement  value  1,  respectively.  The 
evaluation  function  uses  these  statistics  to  compute,  for  each  action,  a  confidence 
interval 1  on  the  underlying  probability,  p0,  of  receiving  reinforcement  value  1  given 
that  action  a  is  executed.  If  n  is  the  number  of  trials  and  x  the  number  of  successes 
arising  from  a  series  of  Bernoulli  trials  2  with  probability  p,  the  upper  bound  of  a 
100(1  —  a)  percent  confidence  interval  for  p  can  be  approximated  by  ub(x,n).  3  The 
evaluation  function  generates  the  action  with  the  highest  upper  bound  on  expected 
reinforcement. 

Initially,  each  of  the  actions  will  have  an  upper  bound  of  1,  and  action  0  will 
be  chosen  arbitrarily.  As  more  trials  take  place,  the  bounds  will  tighten.  The  in¬ 
terval  estimation  method  balances  acting  to  gain  information  with  acting  to  gain 
reinforcement  by  talcing  advantage  of  the  fact  that  there  are  two  reasons  that  the 
upper  bound  for  an  action  might  be  high:  because  there  is  little  information  about 
that  action,  causing  the  confidence  interval  to  be  large  or  because  there  is  informa¬ 
tion  that  the  action  is  good,  causing  the  whole  confidence  interval  to  be  high.  The 
parameter  za/2  is  the  value  that  will  be  exceeded  by  the  value  of  a  standard  normal 
variable  with  probability  a/2.  4  It  controls  the  size  of  the  confidence  intervals  and, 
thus,  the  relative  weights  given  to  acting  to  gain  information  and  acting  to  gain 
reinforcement.  As  a  increases,  more  instances  of  reinforcement  value  0  are  required 

'A  100(1  — a)  percent  confidence  interval  for  a  quantity  is  a  range  of  values  that,  with  probability 
1  —  a,  contains  that  quantity. 

2Bernoulli  trials  are  a  series  of  statistically  independent  events  with  binary  outcomes  that  are 
generated  by  some  fixed  underlying  probability. 

3This  is  a  somewhat  more  complex  form  than  usual,  designed  to  give  good  results  for  small  values 
of  n  [36]. 

4Tables  of  this  relationship  can  be  found  in  most  probability  and  statistics  texts  [36]. 


4.1.  DESCRIPTION  OF  THE  ALGORITHM 


Algorithm  10  (IE)  The  initial  state ,  s0,  consists  of  the  integer  variables  x0 
Xit  and  n\,  each  initialized  to  0. 

u(s,a,r )  =  if  a  =  0  then  begin 
x0  :=  xQ  +  r 
n0  :=  n0  +  1 
end  else  begin 
Xi  :=  X\  +  r 
m  :=  ni  +  1 

end 

e(s)  =  if  ub(x0,rio)  >  ub(xi,ni )  then 
return  0 

else 

return  1 


where 


and  za/2  >  0. 


Figure  21:  The  interval-estimation  (IE)  algorithm. 
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aOs  aOt  aOb  als  alt  alb 
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147) 
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( 

0  / 

3) 

.56151 

Figure  22:  A  sample  run  with  po  —  .55,  p\  =  .45,  and  zaji  —  1.96.  In  this  case,  it 
converges  very  quickly. 

to  drive  down  the  upper  bound  of  the  confidence  intervals,  causing  more  weight  to 
be  placed  on  acting  to  gain  information.  By  the  DeMoivre-Laplace  theorem  [36], 
these  bounds  will  converge,  in  the  limit,  to  the  true  underlying  probability  values, 
and,  hence,  if  each  action  is  continually  attempted,  this  algorithm  will  converge  to 
a  function  that  satisfies  Opt. 

In  order  to  provide  intuition  about  the  workings  of  this  algorithm,  Figures  22 
and  23  show  output  from  two  sample  runs  in  a  simulated  environment  in  which 
the  actions  ao  and  a\  succeed  with  probabilities  po  and  pi.  The  listings  show  the 
number  of  success  and  trials  of  ao  (the  columns  headed  aOs  and  aOt),  the  upper 
bound  on  the  confidence  interval  of  po  (the  column  headed  aOb)  and  the  same  for 
ai  and  pi  (columns  headed  als,  alt,  and  alb).  These  statistics  are  just  shown  at 
interesting  points  during  the  run  of  the  algorithm.  In  Figure  22,  the  first  few  trials 
of  a\  fail,  causing  the  estimate  of  pi  to  be  quite  low;  it  will  be  executed  a  few  more 
times,  once  the  upper  bound  for  po  is  driven  near  .56.  The  run  shown  in  Figure 
23  is  somewhat  more  characteristic.  The  two  actions  have  similar  probabilities  of 
success,  so  it  takes  a  long  time  for  one  to  establish  dominance. 


4.2  Analysis 

In  order  to  analytically  compare  this  algorithm  with  other  algorithms,  we  would  like 
to  know  the  expected  error  of  executing  this  algorithm  in  an  environment  specified 
by  the  action-success  probabilities  po  and  pi.  This  section  informally  derives  an 
approximate  expression  for  the  expected  error  in  terms  of  po,  pi,  and  za/2- 
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aOs  aOt  aOb 
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Figure  23:  Another  sample  run  with  po  =  .55,  pi  =  .45,  and  za/2  =  1.96.  This  time, 
the  two  actions  battle  for  a  long  time,  but  ao  is  clearly  winning  after  10,000  trials. 


Regular  Error 


For  concreteness,  let  us  assume  that  po  >  Pi-  An  error  occurs  every  time  a\  is 
executed,  and  we  expect  it  to  be  executed  a  number  of  times  that  is  sufficient  to 
drive  the  upper  bound  of  pi  below  the  actual  value  of  pp.  We  can  compute  this 
expected  number  of  errors  by  setting  the  expected  value  of  the  upper  bound  on  p\ 
equal  to  po  and  solving  for  ni.  The  expected  value  of  the  upper  bound  on  p\  is 
approximately5  the  upper  bound  with  the  number  of  successes  set  to  nipi.  This 
allows  us  to  solve  the  equation  u6(n1pi,n1)  =  po  for  n*,  yielding 


nx 


~Pd) 
(Po  ~  Pi)2 


As  po  and  pi  grow  close,  goes  to  infinity.  This  is  as  it  should  be — it  becomes 
infinitely  hard  to  tell  which  of  the  two  actions  is  better.  We  can  simplify  this 
expression  further  by  abstracting  away  from  the  actual  values  of  po  and  pi  and 
considering  their  difference,  6,  instead.  For  probabilities  with  a  fixed  difference,  n} 
is  maximized  by  setting  pi  to  .5  and  po  to  .5  +  £.  Making  this  simplification,  we  can 
bound  ri|  above  by 

462  ' 


This  is  an  approximate  upper  bound  on  the  expected  number  of  errors  that  will  be 
made  on  a  run  of  infinite  length.  The  amount  of  error  can  be  obtained  simply  by 

5This  is  only  an  approximation  because  ni  occurs  inside  a  square-root,  which  does  not  commute 
with  the  expectation  operator. 
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error 


Figure  24:  Expected  regular  error  on  an  infinite  run  as  a  function  of  8,  with  za/2  = 
1.96. 

multiplying  by  8,  the  magnitude  of  the  error,  yielding 

4  6  ’ 

which  is  plotted  as  a  function  of  8  in  Figure  24. 

This  result  is  somewhat  disturbing,  because  the  amount  of  error  on  an  infinitely 
long  run  can  be  made  arbitrarily  large  by  making  8  arbitrarily  small.  However,  it  is 
possible  to  bound  the  amount  of  error  on  a  finite  run  of  length  m.  The  maximum 
expected  number  of  errors  that  could  be  made  on  such  a  run  is  m /2  (when  the  two 
probabilities  are  equal,  we  expect  to  perform  the  actions  equal  numbers  of  times). 
The  number  of  errors  is  monotonically  decreasing  in  8,  so  we  can  easily  find  the 
largest  value  of  6  that  could  cause  this  many  errors  by  solving  the  equation 

m  z2 

~2=4P 

for  8,  getting  Thus,  the  maximum  expected  regular  error  on  a  run  of  length 

m  would  be  __ 

Zy/m 

’ 
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Figure  25:  A  sample  run  with  po  —  .55,  pi  =  .45,  and  z0/2  =  1.96.  The  first  action 
almost  gets  stuck. 

obtained  by  multiplying  the  maximum  number  of  errors,  m/2,  by  the  maximum 
magnitude  of  the  error.  This  maximum  regular  error  is  (^(m1/2),  which  means  that 
the  interval  estimation  algorithm,  like  the  BANDIT  algorithm,  performs  within  a 
constant  factor  of  optimal  when  the  environment  is  as  hostile  as  possible. 


Error  Due  to  Sticking 

The  analysis  of  the  previous  section  was  all  carried  out  under  the  assumption  that 
the  action  ao  would  be  executed  an  infinite  number  of  times  during  an  infinite  rim. 
Unfortunately,  this  is  not  always  the  case — it  is  possible  for  aQ  to  get  stuck  below  aj 
in  the  following  way.  If  there  is  a  statistically  unlikely  series  of  trials  of  a0  that  cause 
the  upper  bound  on  po  to  go  below  the  actual  value  of  pi,  then  it  is  very  likely  that 
ao  will  never  be  executed  again.  When  this  happens,  we  shall  say  that  ao  is  stuck. 
A  consequence  of  ao  being  stuck  is  that  errors  will  be  made  for  the  remainder  of  the 
run.  The  process  of  sticking  is  illustrated  by  two  sample  runs.  In  Figure  25,  there 
is  an  early  series  of  failures  for  ao,  causing  a\  to  be  dominant.  However,  because 
the  upper  bound  on  po  was  not  driven  below  pj,  the  upper  bound  on  pi  eventually 
goes  down  far  enough  to  cause  more  trials  of  a0,  which  bring  its  upper  bound  back 
up.  The  run  shown  in  Figure  26  is  a  case  of  permanent  sticking.  After  0  successes 
in  5  trials,  the  upper  bound  on  the  confidence  interval  for  po  is  less  than  plt  causing 
a\  to  be  executed  for  the  remainder  of  the  run. 

By  assuming  that  once  ao  becomes  stuck  below  ai  it  never  becomes  unstuck,  we 
can  bound  expected  error  due  to  sticking  on  a  run  in  which  a0  would  be  executed 
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Figure  26:  A  sample 

run  with  po 

=  -55,  Pi 

II 

O* 

and  za/2  =  1.96.  Here,  the  first 

action  really  does  get  stuck  below  the  second. 

T  times,  if  unstuck,  by 
T 

^Pr(u6(x0,<)  first  goes  below  pi  at  time  t)(T  —  t)(po  -  pO  . 

f=l 

It  is  the  sum,  over  all  time  steps  t  on  which  a0  is  executed,  of  the  probability  that 
ao  first  gets  stuck  at  time  t  times  the  number  of  time  steps  that  remain,  (T  —  t), 
times  the  magnitude  of  the  error,  (po  —  Pi )  •  By  solving  for  Xo,  we  can  transform  the 
constraint  that  ub(x0,t)  <  pi  into 


x0  < 


*Pi  ~  20/2\Api(1  ~Pi) 


Now  we  must  compute  the  probability  that  Xq  first  goes  below  some  function 
f(t)  at  time  t.  The  sequence  of  values  taken  on  by  Xq  over  time  can  be  modeled  as 
a  0-1  random  walk,  with  x0(t)  the  value  taken  on  by  the  walk  at  time  t.  Figure  27 
depicts  the  function  /  and  process  x0.  Letting  k  =  [/(t)J,  the  probability  that  x0 
first  goes  below  /  at  time  t  is  the  product  of  the  probabilities  that  x0(t)  =  k  and 
that  Xq  never  goes  below  /  before  time  t.  The  first  probability  is  simply 

We  can  approximate  the  probability  that  Xq  never  goes  below  /  before  time  t  by 
substituting  for  /  the  line  /  that  goes  through  the  point  ( t ,  k)  with  slope  /'(<).  This 
line  is  approximately  tangent  to  f(t).  The  probability  that  x0  never  goes  below  l 
before  time  t  can  be  approximated  by  constructing  a  new  random  walk  problem  as 
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shown  in  Figure  28.  The  origin  is  the  point  ( t ,  k)  and  the  coordinates  run  backward 
in  each  direction.  The  process  x|J  is  a  0-1  random  walk  with  probability  k/toi  getting 
a  1,  and  the  line  /  is  the  same  as  before.  The  probability  that  a  0-1  random  walk 
ever  hits  a  line  through  the  origin  is  approximately  p/m  where  p  is  the  probability 
of  getting  a  1  in  the  random  walk  and  m  is  the  slope  of  the  line  [38].  Thus,  the 
probability  that  xj  never  hits  the  line  is  1  —  k/(tf'(t)). 

So,  our  final  (approximate)  answer  for  the  probability  that  x0  first  goes  below 
tpi  —  z\Jtp\{I  —  Pi)  at  time  t  (called  sp(t)  for  sticking  probability  at  time  t)  is 


sp(t)  =11- 


t(Pi  ~  !za/2Y/Pi(l-Pi)/<) 


Po(l  ~  Po) 


t-k 


where  k  =  [tPl  -  za/2y/ip^l  -  Pl)J. 
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Figure  28:  New  random  walk  in  inverted  coordinate  system. 

Total  Error 

An  approximate  upper  bound  on  the  total  expected  error  on  a  run  of  length  T  can 
finally  be  expressed  as  the  sum  of  the  regular  and  sticking  error: 

The  sticking  error  is  summed  to  T',  the  expected  number  of  times  ao  will  be  exe- 
cuted,  which  is  T  —  4 (p°Z 2pi  j  •  There  has  not  yet  been  any  discussion  of  appropriate 
values  for  za/2  to  take  on.  It  determines  the  size  of  the  confidence  interval  and, 
therefore,  the  number  of  trials  it  takes  to  drive  an  upper  bound  below  a  certain 
value.  Thus,  regular  error  increases  as  zaj2  increases  and  the  interval  gets  larger. 
As  zaj 2  increases,  the  height  of  f(t)  decreases,  making  it  less  likely  that  x0  will  go 
below.  Thus,  error  due  to  sticking  decreases  as  za/2  increases.  This  tradeoff  is  illus¬ 
trated  in  Figure  29,  which  plots  regular  error  and  error  due  to  sticking  as  functions 
of  zo/2. 

If  we  had  any  a  priori  expectations  (and  had  some  idea  how  to  usefully  approxi¬ 
mate  the  monstrous  form  for  expected  error  as  a  closed  form)  about  the  underlying 
values  of  po  and  p\ ,  we  could  choose  zQ/2  to  minimize  expected  error. 
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error 


Figure  29:  Expected  regular  error  and  sticking  error  plotted  as  a  function  of  za/2. 


4.3  Empirical  Results 


The  approximations  of  the  previous  section  were  tested  by  comparing  predicted 
results  against  actual  results  of  the  interval  estimation  algorithm  in  a  simulated 
world.  The  algorithm  was  executed  for  6  ranging,  in  increments  of  .05,  from  .05 
to  .6,  with  pi  and  p2  equally  spaced  about  .5  (for  6  =  .1,  pj  =  .55  and  p2  =  .45.) 
For  each  value  of  6,  1079  runs  of  length  10,000  were  conducted.  The  variable  zaj2 
had  value  1.96  throughout.  Figure  30  contains  a  plot,  for  each  8,  of  the  mean  error 
of  the  runs  that  did  not  stick,  together  with  the  predicted  error.  The  predictions 
seem  to  be  fairly  accurate  for  regular  error.  Figure  31  shows  the  mean  error  due  to 
sticking  for  each  8 ,  along  with  the  predicted  values.  This  prediction  is  somewhat 
less  accurate.  Nonetheless,  these  results  are  encouraging,  because  we  can  see  that, 
in  these  cases,  the  total  expected  error  is  quite  small — less  than  50  fewer  instants  of 
reinforcement  value  1  than  expected  from  the  optimal  algorithm  for  runs  of  length 
10,000. 
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regular  error 


Figure  30:  Regular  error  as  a  function  of  6 ;  dots  indicate  the  mean  regular  error  on 
1079  runs  of  length  10,000;  the  curve  is  predicted  error. 


sticking  error 


Figure  31:  Error  due  to  sticking  as  a  function  of  8;  dots  indicate  the  mean  error 
due  to  sticking  on  1079  runs  of  length  10,000;  the  curve  is  predicted  error. 
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Task 

Po 

Pi 

1 

.9 

.1 

2 

.6 

.4 

3 

.9 

.8 

4 

.2 

.1 

Table  1:  Parameters  of  test  environments. 

4.4  Experimental  Comparisons 

This  section  reports  the  results  of  a  set  of  experiments  designed  to  compare  the  per¬ 
formance  of  the  interval  estimation  algorithm  with  a  number  of  the  most  promising 
reinforcement-learning  algorithms. 


4.4.1  Algorithms  and  Environments 

The  following  algorithms  were  compared  in  these  experiments: 

•  BANDIT  (described  in  Figure  12) 

•  Lrp  (described  in  Figure  14) 

•  Lri  (described  in  Figure  14) 

•  TS  (described  in  Figure  15) 

•  RC  (described  in  Figure  16) 

•  IE  (described  in  Figure  21) 

Each  of  the  algorithms  was  tested  in  four  different  environments.  The  environ¬ 
ments  generate  Boolean  reinforcement,  with  positive  reinforcement  resulting  with 
probability  po  after  doing  action  oo  and  with  probability  p\  after  doing  action  a\. 
Table  1  shows  the  values  of  po  and  pi  for  each  environment. 
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ALG-TASK 

1 

2 

3 

4 

BANDIT(fc) 

1 

12 

10 

10 

Lrp  (a) 

.60 

.60 

.30 

.40 

Lri  (a) 

.55 

.1 

.05 

.15 

TS  (A) 

.30 

.20 

.20 

.35 

RC  (a) 

.40 

o 

CO 

.15 

.50 

IE  (*«/a) 

3.0 

2.0 

3.0 

2.0 

Table  2:  Best  parameter  value  for  each  algorithm  in  each  environment. 


4.4.2  Parameter  Tuning 


Each  of  the  algorithms  has  a  single  parameter  that  can  be  chosen  to  make  the 
algorithm  more  or  less  conservative;6  the  best  choice  of  value  for  these  parameters 
typically  depends  on  the  length  of  the  run,  because  it  is  more  important  to  insure 
that  an  absorbing  algorithm  converges  to  the  correct  action  on  a  long  run.  For 
each  algorithm  and  environment,  a  series  of  100  trials  of  length  1000  were  run  with 
different  values  of  the  parameter.  Table  2  shows  the  best  parameter  value  found  for 
each  algorithm  and  environment  pair. 

Although  these  experiments  are  illuminating,  in  actual  applications  we  will  typ¬ 
ically  want  to  apply  these  algorithms  to  situations  in  which  the  underlying  proba¬ 
bilities  are  not  known  or  there  is  not  enough  time  to  make  many  runs  with  different 
parameter  values.  In  such  situations,  an  algorithm  that  performs  well  over  a  wide 
range  of  problems  with  the  same  parameter  value  is  to  be  preferred  over  one  that 
performs  well  when  the  parameter  is  chosen  exactly  appropriately  for  the  problem, 
but  poorly  otherwise.  As  we  can  see  in  Table  2,  the  interval  estimation  algorithm 
operates  at  its  best  in  all  of  these  problems  with  a  za/2  value  between  2  and  3 — 
this  roughly  corresponds  to  using  95  or  99  percent  confidence  intervals,  values  that, 
interestingly,  are  often  used  by  human  decision-makers. 


6 Actually,  RC  also  has  parameters  /?  and  a,  but  following  the  author  [70],  these  parameters  were 
held  constant  at  .1  and  .3,  respectively. 
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ALG-TASK 

1 

2 

3 

4 

BANDIT 

.8982 

.5856 

.8892 

.1888 

Lrp 

.8172 

.5190 

.8665 

.1521 

Lri 

.8911 

.5872 

.8780 

.1934 

TS 

.8979 

.5893 

.8941 

.1870 

RC 

.8988 

.5890 

.8897 

.1930 

IE 

.9004 

.5953 

.8937 

.1972 

random, 

.5000 

.5000 

.8500 

.1500 

optimal 

.9000 

.6000 

.9000 

.2000 

Table  3:  Average  reinforcement  over  100  runs  of  length  1000. 


4.4.3  Results 

After  choosing  the  best  parameter  value  for  each  algorithm  and  environment,  the 
performance  of  the  algorithms  was  compared  on  runs  of  length  1000.  The  perfor¬ 
mance  metric  was  average  reinforcement  per  tick,  averaged  over  the  entire  run.  The 
results  are  shown  in  Table  3.  These  results  do  not  tell  the  entire  story,  however. 
It  is  important  to  test  for  statistical  significance  to  be  relatively  sure  that  the  or¬ 
dering  of  one  algorithm  over  another  did  not  arise  by  chance.  Figure  32  shows,  for 
each  task,  a  pictorial  representation  of  the  results  of  a  1-sided  t-test  applied  to  each 
pair  of  experimental  results.  The  graphs  encode  a  partial  order  of  significant  dom¬ 
inance,  with  solid  lines  representing  significance  at  the  .95  level  and  dashed  fines 
representing  significance  at  the  .85  level.  We  can  see  that  the  interval-estimation 
algorithm  dominates  in  nearly  every  task.  On  Task  3  its  average  reinforcement 
value  was  slightly  lower  than  that  of  the  TS  algorithm,  but  this  difference  was  not 
significant.  The  Lrp  algorithm  is,  as  expected,  uniformly  sub-optimal,  and  the  rest 
of  the  algorithms  perform  about  the  same  at  quite  a  high  level. 

Another  view  of  the  relative  performance  of  the  algorithms  is  given  by  examining 
their  learning  curves.  A  learning  curve  is  a  plot  of  expected  reinforcement  values 
versus  time,  which  shows  the  rate  of  performance  improvement.  Figures  33,  34,  35, 
and  36  contain,  for  each  task,  the  superimposed  learning  curves  of  each  algorithm  for 
that  task.  Each  point  represents  the  average  reinforcement  received  over  a  sequence 
of  50  ticks,  averaged  over  100  runs  of  length  1000.  For  Tasks  1  and  2,  the  curves 
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TASK  1  TASK  2 


Figure  32:  Significant  dominance  partial  order  among  algorithms  for  each  task. 
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Figure  33:  Learning  curves  for  Task  1. 

are  hard  to  differentiate;  the  labels  on  the  right  hand  sides  of  the  graphs  indicates 
the  average  relative  performance  of  the  algorithms  on  the  first  sample  of  50  ticks. 


4.5  Extensions 

As  with  the  algorithms  of  Chapter  3,  the  interval  estimation  algorithm  can  be 
extended  to  work  in  more  complex  environments.  All  of  the  extensions  described 
in  this  section  have  been  implemented  and  tested  in  simulated  environments. 


4.5.1  Multiple  Inputs  and  Actions 

The  interval  estimation  algorithm  is  directly  generalizable  to  multiple  actions.  Statis¬ 
tics  are  collected  for  each  action  and  are  used  to  construct  upper  bounds.  The  action 
with  the  highest  upper  bound  is  chosen  to  be  executed  at  each  tick. 

There  is  no  specific  way  to  tailor  the  interval  estimation  algorithm  to  work  in 
situations  where  there  are  multiple  input  states.  The  method  of  making  a  copy  of 
the  internal  state  for  each  possible  input  situation  can  be  applied  to  the  interval 
estimation  algorithm,  but  because  there  is  more  than  a  single  number  associated 
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Figure  34:  Learning  curves  for  Task  2. 


curves  for  Task  3. 


Figure  35:  Learning 
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Figure  36:  Learning  curves  for  Task  4. 

with  each  input  state,  it  would  be  difficult  to  apply  the  linear  association  or  error 
backpropagation  methods. 

4.5.2  Real-valued  Reinforcement 

Rather  than  thinking  of  choosing  the  action  with  the  highest  probability  of  succeed¬ 
ing,  we  can  think  of  choosing  the  action  with  the  highest  expected  reinforcement. 
Under  this  view,  the  interval  estimation  process  can  be  applied  to  the  expected 
value  of  reinforcement  given  that  the  action  a  is  executed  in  situation  i.  If  the  re¬ 
inforcement  for  each  tick  is  binomially  distributed  with  parameter  p,  this  is  exactly 
what  is  taking  place  in  the  version  of  the  algorithm  presented  in  Section  4.1. 

Simple  extensions  can  be  made  if  a  different  probabilistic  distribution  underlies 
the  reinforcement  associated  with  taking  action.  In  order  to  handle  real-valued 
reinforcement,  for  example,  we  can  apply  the  following  two  methods:  assume  the 
normal  distribution  or  use  non-parametric  statistics. 

If  the  reinforcement  values  are  normally  distributed,  we  can  use  standard  statis¬ 
tical  methods  to  construct  a  confidence  interval  for  the  expected  value.  In  order  to 
do  this,  we  must  keep  the  following  statistics:  n,  the  number  of  trials,  £  xi  the  sum 
of  the  reinforcement  received  so  far,  and  the  sum  of  squares  of  the  individual 
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reinforcement  values.  The  upper  bound  of  a  100(1  —  a)  %  confidence  interval  for 
the  mean  of  the  distribution  can  be  computed  by 

nub(n,  J2  x2)  =  y  + 

where  y  =  x/n  is  the  sample  mean, 


is  the  sample  standard  deviation,  and  t^j2  is  Student’s  t  function  with  n  —  1  degrees 
of  freedom  [69].  Other  than  using  a  different  statistical  method  to  compute  the 
upper  bound  of  the  expected  reinforcement,  the  algorithm  remains  the  same. 

Even  when  the  reinforcement  values  cannot  be  assumed  to  be  normally  dis¬ 
tributed,  the  interval  estimation  algorithm  can  be  implemented  using  simple  non- 
parametric  statistics.7  In  this  case,  it  is  not  possible  to  derive  an  upper  bound 
on  expected  value  from  summary  statistics,  so  we  must  keep  the  individual  rein¬ 
forcement  values.  Obviously,  it  is  impossible  to  store  them  all,  so  only  the  data  in 
a  sliding  window  are  kept.  The  non-parametric  version  of  the  interval  estimation 
algorithm  requires  another  parameter,  w ,  that  determines  the  size  of  the  window 
of  data.  The  data  are  kept  sorted  by  value  as  well  as  by  time  received.  The  upper 
bound  of  a  100(1  —  a)  %  confidence  interval  for  the  center  of  the  underlying  distri¬ 
bution  (whatever  it  may  be)  can  be  calculated,  using  the  ordinary  sign  test  [26],  to 
be  the  (n  —  u)th  element  of  the  sorted  data,  if  they  are  labelled,  starting  at  1,  from 
smallest  to  largest,  where  n  is  minimum  of  w  and  the  number  of  instances  received. 
The  value  u  is  chosen  to  be  the  largest  value  such  that 

For  large  values  of  n,  u  can  be  approximated  using  the  normal  distribution. 

7Non-parametric  methods  tend  to  work  poorly  when  there  are  a  small  number  of  discrete  values 
with  very  different  magnitudes.  Practical  results  have  been  obtained  in  such  cases  by  using  methods 
for  the  normal  distribution  with  the  modification  that  each  action  is  performed  at  least  a  certain 
fixed  number  of  times.  This  prevents  the  sample  variance  from  going  to  0  on  small  samples  with 
identical  values. 
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4.5.3  Non-stationary  environments 


The  basic  version  of  the  interval  estimation  algorithm  can  converge  to  absorbing 
states  and,  as  noted  in  Section  3.6.2,  that  makes  it  inappropriate  for  use  in  non- 
stationary  environments.  One  way  to  modify  the  algorithm  in  order  to  fix  this 
problem  is  to  decay  all  of  the  statistics  associated  with  a  particular  input  value  by 
some  value  d  less  than,  but  typically  near,  1,  whenever  that  input  value  is  received. 
This  decaying  will  have  the  effect  that  the  recorded  number  of  trials  of  an  action 
that  is  not  being  executed  decreases  over  time,  causing  the  confidence  interval  to 
grow,  the  upper  bound  to  increase,  and  the  neglected  action  to  be  executed  again. 
If  its  underlying  expected  value  has  increased,  that  will  be  revealed  when  the  action 
is  executed  and  it  may  come  to  be  the  dominant  action. 

This  technique  may  be  similarly  applied  when  using  statistical  methods  for 
normally-distributed  reinforcement  values.  The  non-parametric  method  described 
above  is  already  partially  suited  to  non-stationary  environments  because  old  data 
only  has  a  finite  period  of  influence  (of  length  w)  on  the  choices  of  the  algorithm. 
It  can  be  made  more  responsive  to  environmental  changes  by  occasionally  dropping 
a  data  point  from  the  list  of  an  action  that  is  not  being  executed.  This  will  cause 
the  upper  bound  to  increase,  eventually  forcing  the  action  to  be  executed  again. 

Another  method  of  changing  an  algorithm  to  work  in  non-stationary  environ¬ 
ments  is  to  choose  the  “wrong  action”  (one  that  would  not  have  been  chosen  by  the 
algorithm)  with  probability  1/n,  where  n  is  the  number  of  trials  that  have  taken 
place  so  far.  As  time  passes,  it  becomes  less  and  less  likely  to  do  an  action  that  is 
not  prescribed  by  the  current  learned  policy,  but  executing  these  “wrong”  actions 
ensures  that  if  they  have  become  “right”  due  to  changes  in  the  environment,  the 
algorithm  will  adapt.  This  method  is  more  suited  to  situations  in  which  environ¬ 
mental  changes  are  expected  to  be  more  likely  to  happen  early  in  a  run,  rather  than 
later. 
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4.6  Applicability  of  this  Algorithm 

The  interval  estimation  algorithm  is  of  theoretical  interest  because  of  its  simplicity 
and  its  direct  ties  to  standard  statistical  methods.  It  also  performs  slightly  better 
than  many  proposed  reinforcement-learning  algorithms.  However,  this  algorithm, 
as  well  as  other  reinforcement-learning  algorithms  that  require  copies  of  the  state  for 
each  possible  input,  is  fundamentally  unsuitable  for  learning  in  embedded  systems 
because  of  its  high  computational  complexity  and  lack  of  generalization. 

Except  for  the  linear-association  and  error-backpropagation  algorithms,  all  of 
the  other  algorithms  we  have  examined  require  time  at  least  proportional  to  the 
number  of  possible  actions,  and  space  proportional  to  the  product  of  the  number  of 
inputs  and  the  number  of  actions.  As  we  begin  to  apply  these  algorithms  to  real- 
world  problems,  their  time  and  space  requirements  will  make  them  unpractically 
slow.  A  driving  factor  in  the  rest  of  this  dissertation  is  the  need  for  reinforcement- 
learning  algorithms  with  lower  time  and  space  complexity,  ideally  proportional  to 
the  logarithms  of  the  numbers  of  inputs  and  actions. 

In  addition,  the  interval  estimation  algorithm  completely  compartmentalizes 
the  information  it  has  about  individual  input  situations.  If  it  learns  to  perform  a 
particular  action  in  one  input  situation,  that  has  no  influence  on  what  it  will  do  in 
similar  input  situations.  In  realistic  environments,  an  agent  cannot  expect  ever  to 
encounter  all  of  the  input  situations,  let  alone  have  enough  experience  with  each  one 
to  learn  the  appropriate  response.  Thus,  it  is  important  to  develop  algorithms  that 
will  generalize  across  input  situations.  Generalization  is  a  dangerous  thing,  however; 
too  much  generalization  defeats  the  learning  of  very  complex  action  functions. 

It  is  possible  to  modify  the  interval-estimation  algorithm  in  order  to  support 
some  degree  of  generalization  across  input  situations.  Instead  of  simply  using  the 
upper  bound  on  expected  value  of  an  action  a  in  a  situation  t,  it  is  possible,  instead, 
to  compute  a  kind  of  average  based  on  the  results  of  performing  action  a  in  situations 
similar  to  i,  with  “nearer”  situations  weighted  more  heavily  than  those  farther 
away.  This  technique  requires  a  measure  on  the  nearness  of  input  situations  to  one 
another  and  is  no  longer  directly  grounded  in  statistical  theory.  By  addressing  the 
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generalization  issue  in  this  way,  however,  we  increase  the  computation  time  of  the 
algorithm,  for  now  it  requires  evaluating  action  a  in  a  number  of  input  situations. 
This  only  adds  a  constant  factor  that  depends  on  the  number  of  neighbors  that  are 
used,  but  it  just  makes  a  bad  situation  worse. 

The  interval-estimation  method  might  be  made  both  more  computationally  effi¬ 
cient  and  able  to  generalize  across  situations  by  using  associative  methods  (such  as 
linear  association  or  backpropagation)  to  store  each  of  the  components  of  the  state 
for  an  input-action  pair.  The  statistical  foundations  of  such  an  approach  would  be 
weak,  potentially  causing  a  number  of  problems. 

It  is  important  to  note,  however,  that  in  order  to  find  more  efficient  algorithms, 
we  must  give  up  something.  What  we  will  be  giving  up  is  the  possibility  of  learning 
any  arbitrary  action  mapping.  In  the  worst  case,  the  only  way  to  represent  a 
mapping  is  as  a  complete  look-up  table,  which  is  what  the  multiple-input  version 
of  the  interval-estimation  algorithm  does.  There  are  many  useful  and  interesting 
functions  that  can  be  represented  much  more  efficiently,  and  the  remainder  of  this 
work  will  rest  on  the  hope  and  expectation  that  an  agent  can  learn  to  act  effectively 
in  interesting  environments  without  needing  action  maps  of  pathological  complexity. 


Chapter  5 

Divide  and  Conquer 


Because  we  wish  to  reduce  the  complexity  of  learning  algorithms  to  be  proportional 
to  the  logarithms  of  the  numbers  of  inputs  and  outputs,  it  is  useful  to  think  of  the 
inputs  and  outputs  as  being  encoded  in  some  binary  code.  The  problem,  then,  is 
one  of  constructing  a  function  that  maps  a  number  of  input  bits  to  a  number  of 
output  bits.  If  we  can  construct  algorithms  that  effectively  learn  interesting  classes 
of  functions  with  time  and  space  complexity  that  is  polynomial  in  the  number  of 
input  and  output  bits,  we  will  have  improved  upon  the  previous  group  of  algorithms. 

Having  decided  to  view  the  problem  as  one  of  learning  a  mapping  from  many 
input  bits  to  many  output  bits,  we  can  reduce  this  problem  to  the  problem  of 
learning  a  mapping  from  many  input  bits  to  one  output  bit.  This  chapter  discusses 
such  a  problem  reduction,  first  describing  it  informally,  then  proving  its  correctness. 
It  concludes  with  an  application  of  the  reduction  method  to  a  complex  learning 
problem. 


5.1  Boolean-Function  Learners 

A  Boolean-function  learner  (BFL)  is  a  reinforcement-learning  behavior  that  learns 
a  mapping  from  many  input  bits  to  one  output  bit.  It  has  the  same  input-output 
structure  as  any  of  the  algorithms  discussed  so  far,  but  is  limited  to  having  only 
two  actions.  We  can  describe  a  BFL  with  k  input  bits  in  the  general  form  of  a 
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Figure  37:  A  cascaded  learner  constructed  from  BFL’s. 

learning  behavior  where  so.it  is  the  initial  state,  tt*  is  the  update  function  and  e*  is 
the  evaluation  function. 

A  BFL  is  correct  if  and  only  if  whenever  it  chooses  an  action  a  in  situation 
i,  er(i,a )  >  er(i,  ->a).  That  is,  it  always  chooses  the  action  that  has  the  higher 
expected  reinforcement. 


5.2  Cascade  Algorithm 

We  can  construct  an  algorithm  that  learns  an  action  map  with  N  output  bits  by 
using  N  copies  of  a  Boolean-function  learning  algorithm,  one  dedicated  to  learning 
the  function  corresponding  to  each  individual  output  bit.  If  we  do  this  in  the 
simplest  way,  it  will  not  work  correctly:  when  the  collection  of  BFL’s  generates 
an  output  pattern  that  does  not  result  in  positive  reinforcement,  it  is  difficult  to 
know  whose  fault  it  was.  Perhaps  only  one  of  the  bits  was  “wrong.”  To  avoid 
this  problem,  often  referred  to  the  as  “structural  credit  assignment”  problem,  we 
construct  a  learning  algorithm  (as  shown  in  Figure  37)  from  N  cascaded  BFL’s.  The 
BFL  dedicated  to  learning  to  generate  the  first  output  bit  (referred  to  as  BFLq)  has 
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Algorithm  11  (CASCADE) 

So  =  array  of  s0,M+j  where  j  goes  from  0  to  N  —  1 
u(s,i,a,r)  =  for  j  :=  0  to  N  —  1 

“M+j(s[j],  concat(i,a[0..j  -  l]),a[;'],r) 
e(/i,  i)  =  for  j  :=  0  to  N  —  1 

a[j]  :=  eM+j(s[j],  concat(i,a[0..j  -  1])) 
return  a 


Figure  38:  The  CASCADE  algorithm. 


the  M  real  input  bits  as  input.  The  next  one,  BFL1?  has  the  M  real  inputs  as  well 
as  the  output  of  BFL0  as  input.  In  general,  BFL*  will  have  M  +  k  bits  of  input, 
corresponding  to  the  real  inputs  and  the  outputs  of  the  Jfc  lower-numbered  BFL’s. 
Each  one  learns  what  its  output  bit  should  be,  given  the  input  situation  and  the 
values  of  the  output  bits  of  the  lower-numbered  BFL’s. 


The  cascade  algorithm  can  be  described  as  a  learning  behavior  as  shown  in 
Figure  38.  The  complexity  of  this  algorithm  can  be  expressed  as  a  function  of  the 
complexity  of  the  component  BFL’s,  letting  S(s o,*)  he  the  size  of  the  initial  state 
of  a  BFL  with  k  inputs,  T(ut)  be  the  time  for  the  BFL  update  function  on  k  input 
bits,  and  T(e^)  be  the  time  for  the  BFL  evaluation  function  with  k  input  bits.  For 
the  entire  cascade  algorithm  with  M  input  bits  and  N  output  bits,  the  size  of  the 
state  is 

o(E  s(V«+i)) . 

j=o 


which  reduces  to 


0(N  S(s0,m+n))  ; 


the  time  for  an  update  is 


0(NT(um+n))  ; 


and  the  time  for  an  evaluation  is 


0(NT(cm+n ))  • 
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Given  efficient  algorithms  for  implementing  the  BFL’s,  the  cascade  method  can 
construct  an  efficient  algorithm  for  learning  functions  with  any  number  of  output 
bits.1 

This  efficiency  comes  at  a  price,  however.  Even  if  there  is  no  noise  in  the 
environment,  a  mistake  made  on  bit  j  will  cause  the  reinforcement  information  for 
bits  0  through  j  —  1  to  be  in  error.  To  see  this,  consider  the  case  of  two  output  bits. 
Given  input  instance  i,  bit  0  is  generated  to  have  the  value  1;  then,  bit  1  is  generated, 
as  a  function  of  both  i  and  the  value  of  bit  0,  to  have  the  value  0.  If  the  correct 
response  in  this  case  was  {1,1},  then  each  of  the  bits  will  be  given  low  reinforcement 
values,  even  though  bit  0  was  correct.  This  brings  to  light  another  requirement  of 
the  BFLs:  they  must  work  correctly  in  nonstationaxy  environments.  As  the  higher- 
numbered  BFL’s  are  in  the  process  of  converging,  the  lower-numbered  ones  will  be 
getting  reinforcement  values  that  are  not  necessarily  indicative  of  how  well  they  are 
performing.  Once  the  higher-numbered  BFL’s  have  converged,  the  lower-numbered 
BFL’s  must  be  able  to  disregard  their  earlier  training  and  learn  to  act  correctly 
given  the  functions  that  the  higher-numbered  BFL’s  are  now  implementing. 


5.3  Correctness  and  Convergence 

In  order  to  show  that  this  algorithm  works,  we  must  demonstrate  two  points.  First, 
that  if  the  component  BFL’s  converge  to  correct  behavior  then  the  behavior  of 
the  entire  construction  will  be  correct.  Second,  that  the  component  BFL’s  are 
trained  in  a  way  that  guarantees  that  they  will  converge  to  correct  behavior.  These 
requirements  will  be  referred  to  as  correctness  and  convergence. 

5.3.1  Correctness 

This  section  presents  a  proof  that  the  cascade  construction  is  correct  for  the  case 
of  two  output  bits.  A  similar  proof  can  be  constructed  for  cases  with  any  number 
of  bits.  Assume  that  the  two  BFL’s  have  already  converged,  the  first  one  to  the 


JThis  assumes  that  S(s T(u*),  and  T(et)  are  all  monotonically  non-decreasing  in  k. 
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function  /0,  and  the  second  to  the  function  /j.  The  following  formula  asserts  that 
the  function  /o  is  correct,  given  the  choice  of  /j: 

Vt.  er(/o(t),  /i(«, /o(i)))  >  «r(-i/o(»), /i(z,  ~>/o(0))  5  (1) 

that  is,  that  for  any  value  of  the  input  z,  it  is  better  for  the  first  bit  to  have  the 
value  /o(t)  than  its  opposite.  Similarly,  we  can  assert  that  the  function  fi  is  correct: 

Vz,  b.  er(b,fi(i,b))  >  er(b,  -•/,(*,  6)) ;  (2) 

that  is,  that  for  any  value  of  input  i  and  first  bit  b  ( b  is  the  output  of  /0  in  the 
cascade),  it  is  better  that  the  second  bit  have  the  value  fi(i,  b)  than  its  opposite. 

We  would  like  to  show  that  the  composite  output  of  the  cascade  algorithm  is 
correct:  that  is,  that  for  any  input,  no  two-bit  output  has  higher  expected  rein¬ 
forcement  than  the  one  that  is  actually  chosen  by  f0  and  f\.  This  can  be  stated 
formally  as  the  following  conjunction: 

Vi.er(/0(i),/i(i,/o(i)))  >  er(-./0(«),/i(*,/o(*)))  A  (3) 

Vt.  er(/0(z'),/i(i,/o(z)))  >  cr(/0(i),-’/i(i,/o(*)))  A  (4) 

V*.  cr(/o(»),/i(*,/o(0))  ^  er(~'fo(i),-'fi(i,fo(i)))  -  (5) 

The  first  conjunct,  3,  can  be  shown  with  a  proof  by  cases.  In  the  first  case,  given 
input  z,  function  fi  is  insensitive  to  its  second  argument:  that  is,  fi(i,  x )  =  /j(i,  -ix). 
In  this  case, 

cr(->/o(z'),/i(z,/o(z')))  =  er(-./o(z),/1(i,-i/o(z)))  ;  (6) 

from  6  and  assumption  1  we  can  conclude  that 

er(/o(z),/i(z,/o(t)))  >  cr(-i/0(i),/i(i,/0(i)))  • 

In  the  second  case,  function  /j  is  sensitive  to  its  second  argument  when  the  first 
argument  has  value  z;  that  is,  /j(z,x)  =  -i/i(z,->x).  In  this  case, 

er(-./0(z),/1(z',/o(z)))  =  er(-i/0(z'),-./1(z,-«/o(z)))  .  (7) 

Combining  assumptions  1  and  2,  we  can  derive 


cr(/o(z'),/i(z',/o(z)))  >  er(-i/o(z),-./i(z',-i/o(z'))) . 


(8) 
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From  7  and  8,  we  have  our  desired  conclusion,  that 

cr(/o(*),/i(*,/o(*)))  >  er(_,/o(*)»/i(*»/o(0))  • 

The  second  conjunct,  4,  follows  directly  from  assumption  2. 

The  third  conjunct,  5,  also  requires  a  proof  based  on  the  same  cases  used  in  the 
proof  of  the  first  conjunct.  In  the  first  case,  so 

cr(i/0(t),  f0(i)))  =  er(-i/o(i),  ~*/o(*)))  .  (9) 

From  9  and  result  8  above,  we  can  derive 

er(/o(*)»  /x(t,  /o(«)))  >  cr(-./o(i),  /o(*)))  • 

In  the  second  case,  fi(i,x)  =  ->fi(i,-tx),  so 

cr(->/o(0,->/i(*,/o(0))  =  cr(-./0(i),/i(*,-,/o(*)))  • 

Combining  this  result  with  assumption  1,  we  get  the  desired  result,  that 
cr(/o(i),/i(i,/o(0))  >  cr(-,/o(i),-'/i(*,/o(0))  • 

Thus,  we  can  see  that  local  assumptions  of  correctness  for  each  BFL  are  sufficient 
to  guarantee  global  correctness  of  the  entire  cascade  algorithm. 

5.3.2  Convergence 

Now,  we  must  show  that  the  BFL’s  are  trained  in  a  way  that  justifies  assumptions  1 
and  2  above.  It  is  difficult  to  make  this  argument  precise  without  making  very  strong 
assumptions  about  the  BFL’s  and  the  environment.  Informally,  the  argument  is  as 
follows.  The  highest-numbered  BFL  (BFL/y)  always  gets  correct  reinforcement  and 
so  converges  to  the  correct  strategy;  this  is  because,  independent  of  what  the  lower- 
numbered  BFL’s  are  doing,  it  can  learn  always  to  make  the  best  of  a  bad  situation. 
Once  this  has  happened,  BFL^_j  will  get  correct  reinforcement;  because  its  internal 
learning  algorithm  works  in  non-stationary  environments,  it  will  converge  to  behave 
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in  the  best  way  it  can  in  light  of  what  BFL#  does  (which  now  is  correct).  This 
argument  can  be  made  all  the  way  up  to  BFLo. 

In  general,  the  convergence  process  may  work  somewhat  differently.  Conver¬ 
gence  happens  on  an  input-by-input  basis,  because  there  is  no  guarantee  that  the 
whole  input  space  will  be  explored  during  any  finite  prefix  of  a  run  of  the  agent. 
Rather,  an  input  comes  in  from  the  world  and  all  the  BFL’s  except  BFLyv  generate 
their  output  bits.  This  constitutes  a  learning  instance  for  BFLjv,  which  can  gain 
information  about  what  to  do  in  this  situation.  After  this  situation  has  occurred  a 
few  times,  BFL/v  will  converge  for  that  input  situation  (including  the  bits  generated 
by  the  lower-numbered  BFL’s).  As  the  lower-numbered  BFL’s  begin  to  change  their 
behavior,  they  may  generate  output  patterns  that  BFLjv  has  never  seen,  requiring 
BFLjv  to  learn  what  to  do  in  that  situation  before  the  lower- numbered  BFL’s  can 
continue  their  learning  process. 


5.4  Example 

As  a  simple  illustration  of  the  cascade  reduction  method,  this  section  outlines  its  use, 
in  conjunction  with  the  interval  estimation  algorithm,  to  solve  a  complex  learning 
problem.  As  a  baseline  for  comparison,  we  also  consider  the  use  of  the  interval 
estimation  algorithm  in  conjunction  with  the  method  of  adding  extra  copies  of  the 
basic  statistical  algorithm  to  handle  multiple  actions.  These  two  methods  will  be 
compared  in  terms  of  computational  complexity  and  performance  on  the  learning 
problem. 


5.4.1  Complexity 

If  there  are  M  input  bits  and  N  output  bits,  the  space  complexity  of  an  instance  of 
the  interval  estimation  algorithm  with  a  copy  of  the  basic  algorithm  for  each  input- 
action  pair  is  0{ 2M+N).  The  cascade  method  requires  N  copies  of  the  algorithm, 
each  with  1  output  bit  and  up  to  M + N  —  1  input  bits.  The  total  space  requirement 
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for  the  cascade  algorithm  would,  in  this  case,  be  0(N 2M+N),  which  is  worse  than 
using  the  simple  copying  method. 

The  time  complexity  of  an  update  operation  (if  indexing  is  ignored)  is  constant 
for  the  copying  method;  the  cascade  method  requires  each  component  BFL  to  be 
updated,  using  0(N )  time. 

The  time  complexity  of  an  evaluation  using  the  simple  copying  method  is  0(2^), 
because  each  possible  action  must  be  evaluated.  Using  the  cascade  method,  however, 
it  is  0(N ),  because  only  2  actions  must  be  evaluated  for  each  output  bit. 

Each  cycle  of  a  learning  behavior  requires  one  update  and  one  evaluation:  for  the 
copying  method  this  requires  0(1)  +  0{2N)  =  0(2N)  time;  for  the  cascade  method 
it  requires  0(N)  +  0(N)  =  0(N)  time.  Thus,  the  space  complexity  is  somewhat 
greater  using  the  cascade  method,  but  computation  time  is  considerably  shorter. 


5.4.2  Performance 

A  moderately  complex  reinforcement-learning  problem  is  that  of  learning  to  be  an 
n-bit  adder:  the  learner  has  2n  input  bits,  representing  the  addends,  and  n  output 
bits,  representing  the  result.  It  is  given  reinforcement  value  1  if  the  output  bits  are 
the  binary  sum  of  the  first  n  input  bits  and  the  second  n  input  bits,  otherwise  it  is 
given  reinforcement  value  0.  For  this  experiment,  a  5-bit  adder  problem  was  used; 
it  has  fairly  high  complexity,  with  1024  possible  inputs  and  32  possible  outputs. 

As  we  can  see  in  Figure  39,  which  shows  average  reinforcement  as  a  function 
of  time  (data  points  represent  averages  of  100  time  steps),  the  cascade  method  has 
much  better  performance  than  the  simple  copying  method.  One  reason  for  the 
superior  performance  of  the  cascade  method  over  the  copy  method  is  that,  in  the 
cascade  method,  the  output  bits  are  being  trained  in  parallel  and  the  agent  will  not, 
in  general,  have  to  try  all  (or  even  half)  of  the  2N  possible  actions  in  each  input 
situation  before  finding  the  correct  one.  At  first,  it  may  seem  that  the  algorithm 
is  somehow  taking  advantage  of  the  structure  of  the  adder  problem,  because  the 
general  solution  to  the  n-bit  adder  problem  involves  feeding  intermediate  results 
(carries)  to  later  parts  of  the  computation.  Upon  closer  examination,  however,  it 
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Figure  39:  Performance  of  interval  estimation  algorithm  on  5-bit  adder  problem 
using  copying  method  and  cascade  method  of  generating  multiple  outputs. 

is  clear  that  the  intermediate  results  are  simply  less-significant  output  bits,  which 
sire  not  related  to  the  values  of  the  carries  and  do  not  simplify  the  computation  of 
the  more-significant  output  bits.  Thus,  the  performance  of  the  CASCADE  algorithm 
cannot  be  attributed  to  the  special  structure  of  the  adder  problem. 


Chapter  6 


Learning  Boolean  Functions  in 
ife-DNF 


6.1  Background 

In  the  previous  chapter,  we  saw  that  the  problem  of  learning  an  action  map  with 
many  output  bits  can  be  reduced  to  the  problem  of  learning  a  collection  of  action 
maps  with  single  Boolean  outputs.  Such  action  maps  can  be  described  by  formulae 
in  propositional  logic,  in  which  the  atoms  are  input  bits.  The  formula  (*x  A  i2)  V  ->i0 
describes  an  action  map  that  performs  action  1  whenever  input  bits  1  and  2  are  on 
or  input  bit  0  is  off  and  performs  action  0  otherwise. 

As  we  saw  in  Section  4.6,  any  learning  algorithm  that  is  to  be  more  efficient 
than  methods  like  interval  estimation  will  only  be  able  to  learn  a  restricted  class  of 
action  maps.  When  there  are  only  two  possible  actions,  we  can  describe  the  class 
of  action  maps  that  are  leamable  by  an  algorithm  in  terms  of  syntactic  restrictions 
on  the  corresponding  class  of  propositional  formulae.  This  method  is  widely  used 
in  the  formal  literature  on  concept  learning. 

A  restriction  that  has  proved  useful  to  the  concept-learning  community  is  to 
the  class  of  functions  that  can  be  expressed  as  propositional  formulae  in  fc-DNF.  A 
formula  is  said  to  be  in  disjunctive  normal  form  (DNF)  if  it  is  syntactically  organized 
into  a  disjunction  of  purely  conjunctive  terms;  there  is  a  simple  algorithmic  method 
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for  converting  any  formula  into  DNF  [21].  A  formula  is  in  the  class  Jfc-DNF  if  and 
only  if  its  representation  in  DNF  contains  only  conjunctive  terms  of  length  k  or 
less.  There  is  no  restriction  on  the  number  of  conjunctive  terms — just  their  length. 
Whenever  k  is  less  than  the  number  of  atoms  in  the  domain,  the  class  Jfc-DNF  is  a 
restriction  on  the  class  of  functions. 

The  next  section  presents  Valiant’s  algorithm  for  learning  functions  in  Jfc-DNF 
from  input-output  pairs.  The  following  sections  describe  algorithms  for  learning 
action  maps  in  fc-DNF  from  reinforcement  and  present  the  results  of  an  empiri¬ 
cal  comparison  of  their  performance.  For  each  reinforcement-learning  algorithm, 
the  inputs  are  bit- vectors  of  length  M,  plus  a  distinguished  reinforcement  bit;  the 
outputs  are  single  bits. 


6.2  Learning  fc-DNF  from  Input-Output  Pairs 

Valiant  was  one  of  the  first  to  consider  the  restriction  to  learning  functions  ex¬ 
pressible  in  fc-DNF  [76,77].  He  developed  an  algorithm,  shown  below,  for  learning 
functions  in  fc-DNF  from  input-output  pairs,  which  actually  only  uses  the  input- 
output  pairs  with  output  0. 

Algorithm  12  (VALIANT)  Let  T  be  initialized  to  the  set  of  conjunctive  terms  of 
length  k  over  the  set  of  atoms  (corresponding  to  the  input  bits)  and  their  negations, 
and  let  L  be  the  number  of  learning  instances  required  to  learn  the  concept  to  the 
desired  accuracy.1 

for  i  :=  1  to  L  do  begin 

v  :=  randomly  drawn  negative  instance 
T  :=  T—  any  term  that  is  satisfied  by  v 

end 

return  T 

1This  choice  is  not  relevant  to  our  reinforcement-learning  scenario — the  details  are  described  in 
Valiant’s  papers  [76,77]. 
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Algorithm  13  (LARCKDNF)  Let  Fj  be  a  function  making  an  M-bit  input  vector 
into  a  2 k(jeJ-bit  vector,  each  of  whose  elements  is  the  result  of  evaluating  an  element 
of  T  on  the  raw  input  vector. 

Let  So  of  this  algorithm  be  the  initial  state,  so,  of  an  instance  of  the  LARC  algorithm 
with  bits.  The  update  function  will  be  u  of  LARC,  with  the  input  Fx(i),  and, 

similarly,  the  evaluation  will  be  e  of  LARC,  with  the  input  Fj(t). 


Figure  40:  The  linear-association  reinforcement-comparison  algorithm  for  learning 
functions  in  fc-DNF  from  reinforcement. 

The  VALIANT  algorithm  returns  the  set  of  terms  remaining  in  T,  with  the  inter¬ 
pretation  that  their  disjunction  is  the  concept  that  was  learned  by  the  algorithm. 
This  method  simply  examines  a  fixed  number  of  negative  instances  and  removes  any 
term  from  T  that  would  have  caused  one  of  the  negative  instances  to  be  satisfied.2 


6.3  Combining  the  LARC  and  valiant  Algorithms 

Given  our  interest  in  restricted  classes  of  functions,  we  can  construct  a  hybrid 
algorithm  for  learning  action  maps  in  fc-DNF.  It  hinges  on  the  simple  observation 
that  any  such  function  is  a  linear  combination  of  terms  in  the  set  T,  where  T  is 
the  set  of  conjunctive  terms  of  length  k  over  the  set  of  atoms  (corresponding  to  the 
input  bits)  and  their  negations.  It  is  possible  to  take  the  original  M-bit  input  signal 
and  transduce  it  to  a  wider  signal  that  is  the  result  of  evaluating  each  member  of  T 
on  the  original  inputs.  We  can  use  this  new  signal  as  input  to  a  linear-associative 
reinforcement  learning  algorithm,  such  as  Sutton’s  LARC  algorithm  (described  in 
Figure  18.  If  there  are  M  input  bits,  the  set  T  has  size  because  we  are 

choosing  from  the  set  of  input  bits  and  their  negations.  However,  we  can  eliminate 
all  elements  that  contain  both  an  atom  and  its  negation,  yielding  a  set  of  size  2k  . 
The  combined  algorithm,  called  LARCKDNF,  is  described  formally  in  Figure  40. 

2Valiant’s  presentation  of  the  algorithm  defines  T  to  be  the  set  of  conjunctive  terms  of  length 
k  or  less  over  the  set  of  atoms  and  their  negations;  however,  because  any  term  of  length  less  than 
k  can  be  represented  as  a  disjunction  of  terms  of  length  k,  we  use  a  smaller  set  T  for  simplicity  in 
exposition  and  slightly  more  efficient  computation  time. 
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The  space  required  by  the  LARCKDNF  algorithm,  as  well  as  the  time  to  update 
the  internal  state  or  to  evaluate  an  input  instance,  is  proportional  to  the  size  of  T, 
and  thus,  0(Mk). 


6.4  Interval  Estimation  Algorithm  for  fc-DNF 

The  interval  estimation  algorithm  for  fc-DNF  is,  like  the  algorithm  described  in 
Section  6.3,  based  on  Valiant’s  algorithm,  but  the  interval  estimation  algorithm 
uses  standard  statistical  estimation  methods,  like  those  used  in  the  IE  algorithm, 
rather  than  weight-adjustments. 

The  algorithm  will  first  be  described  independent  of  particular  statistical  tests, 
which  will  be  introduced  later  in  the  section.  We  shall  need  the  following  definitions, 
however.  An  input  bit  vector  satisfies  a  term  whenever  all  the  bits  mentioned 
positively  in  the  term  have  value  1  in  the  input  and  all  the  bits  mentioned  negatively 
in  the  term  have  value  0  in  the  input.  The  quantity  er(t,a )  is  the  expected  value 
of  the  reinforcement  that  the  agent  will  gain,  per  trial,  if  it  generates  action  a 
whenever  term  t  is  satisfied  by  the  input  and  action  ->o  otherwise.  The  quantity 
ubra(t,  a)  is  the  upper  bound  of  a  100(1  —  a)%  confidence  interval  on  the  expected 
reinforcement  gained  from  performing  action  a  whenever  term  t  is  satisfied  by  the 
input.  The  formal  definition  of  the  algorithm  is  given  in  Figure  41. 

At  any  moment  in  the  operation  of  this  algorithm,  we  can  extract  a  symbolic 
description  of  its  current  hypothesis.  It  is  the  disjunction  of  all  terms  t  such  that 
ubrQ(t ,  1)  >  u6ra(t, 0)  and  Pr(er(i,  1)  =  er(<,0))  <  /?.  This  is  the  fc-DNF  expression 
according  to  which  the  agent  is  choosing  its  actions. 

As  in  the  regular  interval-estimation  algorithm,  the  evaluation  criterion  is  chosen 
in  such  a  way  as  to  make  the  important  trade-off  between  acting  to  gain  information 
and  acting  to  gain  reinforcement.  Thus,  the  first  requirement  for  a  term  to  cause  a 
1  to  be  emitted  is  that  the  upper  bound  on  the  expected  reinforcement  of  emitting 
a  1  when  this  term  is  satisfied  is  higher  than  the  upper  bound  on  the  expected 
reinforcement  of  emitting  a  0  when  the  term  is  satisfied. 


6.4.  INTERVAL  ESTIMATION  ALGORITHM  FOR  K-DNF 
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Algorithm  14  (lEKDNF) 

«o  =  the  set  T,  with  a  collection  of  statistics 
associated  with  each  member  of  the  set 

e(s,i )  =  for  each  t  in  s 

if  i  satisfies  t  and 
ubra(t,  1)  >  ubra(t,  0)  and 
Pr(er(t,  1)  =  er(t,0))  <  /3 
then  return  1 
return  0 


u(s,i,a,r)  =  for  each  t  in  s 

update Jerm_statistics(t,  i,  a,  r) 
return  s 


Figure  41:  The  interval  estimation  algorithm  for  learning  concepts  in  fc-DNF  from 
reinforcement. 


Let  the  equivalence  probability  of  a  term  be  the  probability  that  the  expected 
reinforcement  is  the  same  no  matter  what  choice  of  action  is  made  when  the  term  is 
satisfied.  The  second  requirement  for  a  term  to  cause  a  1  to  be  emitted  is  that  the 
equivalence  probability  be  small.  Without  this  criterion,  terms  for  which  no  action 
is  better  will,  roughly,  alternate  between  choosing  action  1  and  action  0.  Because 
the  output  of  the  entire  algorithm  will  be  1  whenever  any  term  has  value  1,  this 
alternation  of  values  can  cause  a  large  number  of  wrong  answers.  Thus,  if  we  can 
convince  ourselves  that  a  term  is  irrelevant  by  showing  that  its  choice  of  action 
makes  no  difference,  we  can  safely  ignore  it. 

In  the  simple  Boolean  reinforcement-learning  scenario,  the  necessary  statistical 
tests  are  quite  simple.  For  each  term,  the  following  statistics  are  stored:  n0,  the 
number  of  trials  of  action  0;  50,  the  number  of  successes  of  action  0;  ,  the  number  of 
trials  of  action  1;  and  sl5  the  number  of  successes  of  action  1.  These  are  incremented 
only  when  the  associated  term  is  satisfied  by  the  current  input  instance.  Using  the 
definition  of  ub(x,n )  from  Figure  21,  we  can  define  ubra(t,  0)  as  ub(sQ,n0)  and 
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ubrQ(t,  1)  as  ub(si,rii),  where  s0>  n0,  sl5  and  ni  axe  the  statistics  associated  with 
term  t  and  a  is  used  in  the  computation  of  ub. 

To  test  for  equality  of  the  underlying  Bernoulli  parameters,  we  use  a  two-sided 
test  at  the  ft  level  of  significance  that  rejects  the  hypothesis  that  the  parameters 
are  equal  whenever 


*£L  _  li. 
no  ni 


I7EIE  )(1-  *P+*1  Wno+n.) 
Jha+Zllll  n0+na  Mnoi-nu 

V  noni 


is  either 


<  ~zp/2 
<  or 

.  >+Z/3/2 


where  zp/2  is  a  standard  normal  deviate  [36].  Because  sample  size  is  important  for 
this  test,  the  algorithm  is  slightly  modified  to  ensure  that,  at  the  beginning  of  a  rim, 
each  action  is  chosen  a  minimum  number  of  times.  This  parameter  will  be  referred 
to  as  /?mi„. 

As  for  the  interval-estimation  algorithm,  real-valued  reinforcement  can  be  han¬ 
dled  in  IEKDNF  using  statistical  tests  appropriate  for  normally-distributed  values  or 
for  non-parametric  models.  In  nonstationary  environments,  statistics  can  be  scaled 
in  order  to  ensure  that  the  algorithm  does  not  stay  converged  to  a  non-optimal 
strategy. 

The  order  complexity  of  this  algorithm  is  the  same  as  that  of  the  LARCKDNF 
algorithm  of  Section  6.3,  namely  0(Mk). 


6.5  Empirical  Comparison 


This  section  reports  the  results  of  a  set  of  experiments  designed  to  compare  the 
performance  of  the  algorithms  discussed  in  this  chapter  with  one  another,  as  well 
as  with  some  other  standard  methods. 
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6.5.1  Algorithms  and  Environments 

The  following  algorithms  were  tested  in  these  experiments: 

•  LARC  (Defined  in  Figure  18) 

•  LARC+  (LARC  with  an  extra  input  wired  to  have  a  constant  value) 

•  LARCKDNF  (Defined  in  Figure  40) 

•  IEKDNF  (Defined  in  Figure  41) 

•  BP  (Defined  in  Figures  19  and  20) 

•  IE  (Defined  in  Figure  21) 

The  regular  interval-estimation  algorithm  IE  is  included  as  a  yardstick;  it  is  compu¬ 
tationally  much  more  complex  than  the  other  algorithms  and  should  be  expected 
to  out -perform  them. 

Each  of  the  algorithms  was  tested  in  three  different  environments.  The  environ¬ 
ments  are  called  binomial  Boolean  expression  worlds  and  can  be  characterized  by 
the  parameters  M,  ex pr,  p\„  pin,  po,,  and  pon •  The  parameter  M  is  the  number  of 
input  bits;  expr  is  a  Boolean  expression  over  the  input  bits;  pi,  is  the  probability  of 
receiving  reinforcement  value  1  given  that  action  1  is  taken  when  the  input  instance 
satisfies  expr,  is  the  probability  of  receiving  reinforcement  value  1  given  that 
action  1  is  taken  when  the  input  instance  does  not  satisfy  expr,  poa  is  the  probabil¬ 
ity  of  receiving  reinforcement  value  1  given  that  action  0  is  taken  when  the  input 
instance  satisfies  expr,  pon  is  the  probability  of  receiving  reinforcement  value  1  given 
that  action  0  is  taken  when  the  input  instance  does  not  satisfy  expr.  Input  vectors 
are  chosen  by  the  world  according  to  a  uniform  probability  distribution. 

Table  4  shows  the  values  of  these  parameters  for  each  task.  The  first  task  has 
a  simple,  linearly  separable  function;  what  makes  it  difficult  is  the  small  separation 
between  the  reinforcement  probabilities.  Task  6  has  highly  differentiated  reinforce¬ 
ment  probabilities,  but  the  function  to  be  learned  is  a  complex  exclusive-or.  Finally, 
Task  7  is  a  simple  conjunctive  function,  but  all  of  the  reinforcement  probabilities 
are  high  and  it  has  twice  as  many  input  bits  as  the  other  two  tasks. 
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Task 

M 

ezpr 

Pis 

Pin 

POs 

POn 

5 

3 

(*o  A  *i)  V  (ii  At2) 

.6 

.4 

A 

.6 

6 

3 

(io  A  — »*j)  V  (ii  A  ->*2)  V  (i2  A  -u’o) 

.9 

.1 

.1 

.9 

7 

6 

*2  A  ->is 

.9 

.5 

.6 

.8 

Table  4:  Parameters  of  test  environments  for  fc-DNF  experiments. 


6.5.2  Parameter  Tuning 

Each  of  the  algorithms  has  a  set  of  parameters.  For  both  IEKDNF  and  LARCKDNF, 
k  =  2.  Algorithms  LARC,  LARC+,  and  LARCKDNF  have  parameters  a,  /?,  and 
<7.  Following  Sutton  [70],  parameters  /?  and  a  in  LARCKDNF,  LARC,  and  LARC-f 
are  fixed  to  have  values  .1  and  .3,  respectively.  The  IEKDNF  algorithm  has  two 
confidence-interval  parameters,  za/2  and  zp/2,  and  a  minimum  age  for  the  equality 
test  /3min ,  while  the  IE  algorithm  has  only  za/2.  Finally,  the  BP  algorithm  has  a  large 
set  of  parameters:  /?,  learning  rate  of  the  evaluation  output  units,  /?/,,  learning  rate 
of  the  evaluation  hidden  units,  p,  learning  rate  of  the  action  output  units,  and  ph, 
learning  rate  of  the  action  hidden  units.  All  of  the  parameters  for  each  algorithm 
are  chosen  to  optimize  the  behavior  of  that  algorithm  on  the  chosen  task.  The 
success  of  an  algorithm  is  measured  by  the  average  reinforcement  received  per  tick, 
averaged  over  the  entire  ram. 

For  each  algorithm  and  environment,  a  series  of  100  trials  of  length  3000  were 
run  with  different  parameter  values.  Table  5  shows  the  best  set  of  parameter  values 
found  for  each  algorithm-environment  pair. 


6.5.3  Results 

Using  the  best  parameter  values  for  each  algorithm  and  environment,  the  perfor¬ 
mance  of  the  algorithms  was  compared  on  runs  of  length  3000.  The  performance 
metric  was  average  reinforcement  per  tick,  averaged  over  the  entire  run.  The  re¬ 
sults  are  shown  in  Table  6,  together  with  the  expected  reinforcement  of  executing  a 
completely  random  behavior  (choosing  actions  0  and  1  with  equal  probability)  and 
of  executing  the  optimal  behavior. 
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ALG-TASK 

l 

2 

3 

LARC 

a 

.0625 

.125 

.125 

LARC+ 

a 

.125 

.0625 

.25 

LARCKDNF 

a 

.125 

.25 

.03125 

IEKDNF 

Za/2 

3 

3.5 

2.5 

ZP/2 

1 

2.5 

3.5 

flmin 

15 

5 

25 

BP 

0 

.1 

.25 

.1 

0k 

.2 

.3 

.05 

p 

.15 

.15 

.35 

ph 

.2 

.05 

.1 

IE 

za/2 

3.0 

1.5 

2.5 

Table  5:  Best  parameter  values  for  each  fc-DNF  algorithm  in  each  environment. 


ALG-TASK 

1 

2 

3  3 

LARC 

.5329 

.7418 

.7769 

LARC+ 

.5456 

.7459 

.7722 

LARCKDNF 

.5783 

.8903 

.7825 

IEKDNF 

.5789 

.8900 

.7993 

BP 

.5456 

.7406 

.7852 

IE 

.5827 

.8966 

.7872 

random 

.5000 

.5000 

.6750 

optimal 

.6000 

.9000 

.8250 

Table  6:  Average  reinforcement  for  fc-DNF  problems  over  100  runs  of  length  3000. 
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As  in  the  set  of  experiments  described  in  Chapter  4,  we  must  examine  the 
relationships  of  statistically  significant  dominance  among  the  algorithms  for  each 
task.  Figure  42  shows,  for  each  task,  a  pictorial  representation  of  the  results  of  a 
1-sided  t-test  applied  to  each  pair  of  experimental  results.  The  graphs  encode  a 
partial  order  of  significant  dominance,  with  solid  lines  representing  significance  at 
the  .95  level  and  dashed  lines  representing  significance  at  the  .85  level. 

With  the  best  parameter  values  for  each  algorithm,  it  is  also  instructive  to 
compare  the  rate  at  which  performance  improves  as  a  function  of  the  number  of 
training  instances.  Figures  43,  44,  and  45  show  superimposed  plots  of  the  learning 
curves  for  each  of  the  algorithms.  Each  point  represents  the  average  reinforcement 
received  over  a  sequence  of  100  steps,  averaged  over  100  runs  of  length  3000. 

6.5.4  Discussion 

On  Tasks  5  and  6,  the  basic  interval-estimation  algorithm,  IE,  performed  signifi¬ 
cantly  better  than  any  of  the  other  algorithms.  The  magnitude  of  its  superiority, 
however,  is  not  extremely  great — Figures  43  and  44  reveal  that  the  IEKDNF  and 
LARCKDNF  algorithms  have  similar  performance  characteristics  both  to  each  other 
and  to  IE.  On  these  two  tasks,  the  overall  performance  of  IEKDNF  and  LARCKDNF 
were  not  found  to  be  significantly  different. 

The  backpropagation  algorithm,  BP,  performed  considerably  worse  than  ex¬ 
pected  on  Tasks  5  and  6.  It  is  very  difficult  to  time  the  parameters  for  this  al¬ 
gorithm,  so  its  bad  performance  may  be  explained  by  a  sub-optimal  setting  of 
parameters.3  However,  it  is  possible  to  see  in  the  learning  curves  of  Figures  43  and 
44  that  the  performance  of  BP  was  still  increasing  at  the  ends  of  the  runs.  This  may 
indicate  that  with  more  training  instances  it  would  eventually  converge  to  optimal 
performance. 


3In  the  parameter  tuning  phase,  the  parameters  were  varied  independently — it  may  well  be 
necessary  to  perform  gradient-ascent  search  in  the  parameter  space,  but  that  is  a  computationally 
difficult  task,  especially  when  the  evaluation  of  any  point  in  parameter  space  may  have  a  high  degree 
of  noise. 
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Figure  42:  Significant  dQminance  partial  order  among  fc-DNF  algorithms  for  each 
task. 
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Figure  43:  Learning  curves  for  Task  5. 


opt 


Figure  44:  Learning  curves  for  Task  6. 
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bucket  of  100  ticks 


Figure  45:  Learning  curves  for  Task  7. 

The  linear-association  algorithms  performed  poorly  on  both  Tasks  5  and  6.  This 
poor  performance  was  expected  on  Task  6,  because  such  algorithms  are  known  to  be 
unable  to  learn  non-linearly-separable  functions  [47].  Task  5  is  difficult  for  these  al¬ 
gorithms  because,  during  the  execution  of  the  algorithm,  the  evaluation  function  is 
often  too  complex  to  be  learned  by  the  simple  linear  associator.  Adding  a  constant 
input  value  to  the  LARC  algorithm  made  a  significant  improvement  in  performance; 
this  is  not  surprising,  because  it  allows  the  algorithm  to  find  discrimination  hyper¬ 
planes  that  do  not  pass  through  the  origin  of  the  space. 

Task  7  reveals  many  interesting  strengths  and  weaknesses  of  the  algorithms. 
One  of  the  most  interesting  is  that  IE  is  no  longer  the  best  performer.  Because 
the  target  function  is  simple  and  there  is  a  larger  number  of  input  bits,  the  ability 
to  generalize  across  input  instances  becomes  important.  The  IEKDNF  algorithm  is 
able  to  find  the  correct  hypothesis  early  during  the  run  (this  is  apparent  in  the 
learning  curve  of  Figure  45).  However,  because  the  reinforcement  values  are  not 
highly  differentiated  and  because  the  size  of  the  set  T  is  quite  large,  it  begins  to 
include  extraneous  terms  due  to  statistical  fluctuations  in  the  environment,  causing 
slightly  degraded  performance.  The  IE,  BP,  and  LARCKDNF  algorithms  all  have  very 
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similar  performance  on  Task  7,  with  the  linear-associator  algorithms  performing 
slightly  worse,  but  still  reasonably  well. 

6.6  Conclusion 

Prom  this  study,  we  can  see  that  it  is  useful  to  design  algorithms  that  are  tai¬ 
lored  to  learning  certain  restricted  classes  of  functions.  The  two  specially-designed 
algorithms  far  out-performed  standard  methods  of  comparable  complexity.  The 

i 

LARCKDNF  and  IEKDNF  algorithms  each  have  their  strengths  and  weaknesses.  It  is 
possible  that  LARCKDNF  may  outperform  IEKDNF  to  some  extent  because  in  LAR¬ 
CKDNF  each  term  gets  to  contribute  to  the  answer  with  different  degrees.  This 
avoids  errors  that  occur  in  IEKDNF  when  a  single  term  is  barely  over  the  threshold 
for  generating  a  1.  On  the  other  hand,  the  state  of  IEKDNF  has  internal  semantics 
that  are  clear  and  directly  interpretable  in  the  language  of  classical  statistics.  This 
simplifies  the  process  of  extending  the  algorithm  to  apply  to  other  types  of  worlds 
in  a  principled  manner. 


Chapter  7 

A  Generate-and-Test  Algorithm 


This  chapter  describes  GTRL,  a  highly  parametrized  generate-and-test  algorithm  for 
learning  Boolean  functions  from  reinforcement.  Some  parameter  settings  make  it 
highly  time-  and  space-efficient,  but  allow  it  to  learn  only  a  restricted  class  of  func¬ 
tions;  other  parameter  settings  allow  arbitrarily  complex  functions  to  be  learned, 
but  at  a  cost  in  time  and  space. 


7.1  Introduction 

The  generate-and-test  reinforcement-learning  algorithm,  GTRL,  performs  a  bounded, 
real-time  beam-search  in  the  space  of  Boolean  formulae,  searching  for  a  formula  that 
represents  an  action  function  that  exhibits  high  performance  in  the  environment. 
This  algorithm  adheres  to  the  strict  synchronous  tick  discipline  of  the  learning- 
behavior  formulation  of  Chapter  2,  performing  its  search  incrementally,  while  using 
the  best  available  solution  to  generate  actions  for  the  inputs  with  which  it  is  pre¬ 
sented. 

The  algorithm  has,  at  any  time,  a  set  of  hypotheses  that  it  is  considering.  A 
hypothesis  has  as  its  main  component  a  Boolean  formula  whose  atoms  are  input  bits 
or  their  negations.  Negations  can  occur  only  at  the  lowest  level  in  the  formulae.1 
Each  formula  represents  a  potential  action-map  for  the  behavior,  generating  action  1 

1  Any  Boolean  formula  can  be  put  in  this  form  using  DeMorgan’s  laws. 
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whenever  the  current  input  satisfies  the  formula  and  action  0  when  it  does  not.  The 
GTRL  algorithm  generates  new  hypotheses  by  combining  the  formulae  of  existing 
hypotheses  using  syntactic  conjunction  and  disjunction  operators.2  This  generation 
of  new  hypotheses  represents  a  search  through  Boolean-formula  space;  statistics 
related  to  the  performance  of  the  hypotheses  in  the  domain  are  used  to  guide  the 
search,  choosing  appropriate  formulae  to  be  combined. 

This  search  is  quite  constrained,  however.  There  is  a  limit  on  the  number  of 
hypotheses  with  formulae  at  each  level  of  Boolean  complexity  (depth  of  nesting  of 
Boolean  operators),  making  the  process  very  much  like  a  beam  search  in  which  the 
entire  beam  is  retained  in  memory.  As  time  passes,  old  elements  may  be  deleted 
from  and  new  elements  added  to  the  beam,  as  long  as  the  size  is  kept  constant. 
This  guarantees  that  the  algorithm  will  operate  in  constant  time  per  input  instance 
and  that  the  space  requirement  will  not  grow  without  bound  over  time.  3 

This  search  method  is  inspired  by  Schlimmer’s  STAGGER  system  [65,66,64,63,62] 
for  learning  Boolean  functions  from  input-output  pairs.  STAGGER  makes  use  of  a 
number  of  techniques,  including  a  Bayesian  weight-updating  component,  that  are 
inappropriate  for  the  reinforcement -learning  problem.  In  addition,  it  is  not  strictly 
limited  in  time-  or  space-complexity.  The  GTRL  algorithm  exploits  STAGGER’s  idea 
of  performing  incremental  search  in  the  space  of  Boolean  formulae,  using  statistical 
estimates  of  the  “necessity”  and  “sufficiency”  (these  notions  will  be  made  concrete 
in  the  following  discussion)  to  guide  the  search. 

The  presentation  of  the  GTRL  algorithm  will  be  independent  of  any  distribu¬ 
tional  assumptions  about  the  reinforcement  values  generated  by  the  environment; 
it  will,  however,  assume  that  the  environment  is  consistent  (see  Section  2.1.2  for 
the  definition)  for  the  agent.  The  process  of  tailoring  the  algorithm  to  work  for 
particular  kinds  of  reinforcement  will  be  described  in  Section  7.3. 

2Other  choices  of  syntactic  search  operators  are  possible.  Conjunction  and  disjunction  are  used 
here  because  of  the  availability  of  good  heuristics  for  guiding  their  application.  These  heuristics  will 
be  discussed  in  Section  7.5.1. 

3 An  alternative  would  be  to  simply  limit  the  total  number  of  hypotheses,  without  sorting  them 
into  levels.  This  approach  would  give  added  flexibility,  but  would  also  cause  some  increase  in 
computational  complexity.  In  addition,  it  is  often  beneficial  to  retain  hypotheses  at  low  levels  of 
complexity  because  of  their  usefulness  as  building  blocks. 
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As  with  other  learning  behaviors,  we  will  view  the  GTRL  algorithm  in  terms  of 
initial  state,  update  function,  and  evaluation  function,  as  shown  in  Figure  46.  The 
internal  state  of  the  GTRL  algorithm  consists  of  a  set  of  hypotheses  organized  into 
levels.  Along  with  a  Boolean  formula,  each  hypothesis  contains  a  set  of  statistics 
that  reflect  different  aspects  of  the  performance  of  the  formula  as  an  action  map  in 
the  domain.  Each  level  contains  hypotheses  whose  formulae  are  of  a  given  Boolean 
complexity.  Figure  47  shows  an  example  GTRL  internal  state.  Level  0  consists  of 
hypotheses  whose  formulae  are  individual  atoms  corresponding  to  the  input  bits  and 
to  their  negations,  as  well  as  the  hypotheses  whose  formulae  are  the  logical  constants 
true  and  false.4  Hypotheses  at  level  1  have  formulae  that  are  conjunctions  and 
disjunctions  of  the  formulae  of  the  hypotheses  at  level  0.  In  general,  the  hypotheses 
at  level  n  have  formulae  that  consist  of  conjunctions  or  disjunctions  of  two  formulae: 
one  from  level  n  —  1  and  one  from  any  level,  from  0  to  n  —  1.  The  hypotheses  at 
each  level  are  divided  into  working  and  candidate  hypotheses;  the  reasons  for  this 
distinction  will  be  made  clear  during  the  detailed  explanation  of  the  algorithm. 

The  update  function  of  the  GTRL  algorithm  consists  of  two  phases:  first,  up¬ 
dating  the  statistics  of  the  individual  hypotheses  and,  second,  adding  and  deleting 
hypotheses. 

The  evaluation  function  also  works  in  two  phases.  The  first  step  is  to  find  the 
working  hypothesis  at  any  level  that  has  the  best  performance  at  choosing  actions. 
If  the  chosen  working  hypothesis  is  satisfied  by  the  input  to  be  evaluated,  action  1 
is  generated;  if  it  is  not  satisfied,  action  0  is  generated. 

The  following  sections  will  examine  these  processes  in  greater  detail. 


*It  is  necessary  to  include  true  and  false  in  case  either  of  those  is  the  optimal  hypothesis.  Hy¬ 
potheses  at  higher  levels  are  simplified,  so  even  if  a  A  ->a  or  a  V  -*a  were  to  be  constructed,  it  would 
not  be  retained. 
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Algorithm  15  (gtrl) 

s0  =  array [0..L]  of 

record 

working-hypoths:  array [0..HJ  of  hypoth 
candidate-hypoths:  array[0..C]  of  hypoth 
end 

u(s,  i,  a,  r)  =  update-hypotheses  (s,  i ,  a,  r) 
for  each  level  in  s  do  begin 
add-hypotheses  (level,  s) 
promote -hypotheses  (level) 
prune-hypotheses  (level) 

end 

e(s,i)=  h  :=  best-predictor  (s) 

if  satisfies  (i,  h)  then 
return  1 
else  return  0 


Figure  46:  High-level  description  of  the  GTRL  algorithm. 


Level  2 

Level  1 


(a  v  b)  a  (-J>  v  ->c)  (bvc)  a— a  (c  a  -ta)  v  (a  a  -b) 


avb 

bvc 

CAnfl 

—ib  v  — tc 

a  a  — i> 

Level  0 

a 

El] 

b 

\  -ib  1 

l  C  1 

1 

t 

[3 

Figure  47:  Example  GTRL  internal  state. 
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7.3  Statistics 

Associated  with  each  working  and  candidate  hypothesis  is  a  set  of  statistics;  these 
statistics  are  used  to  choose  working  hypotheses  for  generating  actions  and  for  com¬ 
bination  into  new  candidate  hypotheses  at  higher  levels.  The  algorithms  for  updat¬ 
ing  the  statistical  information  and  computing  statistical  quantities  are  modularly 
separated  from  the  rest  of  the  GTRL  algorithm.  The  choice  of  statistical  module 
will  depend  on  the  kind  and  distribution  of  reinforcement  values  received  from  the 
environment.  Appendix  A  provides  the  detailed  definitions  of  statistics  modules 
for  cases  in  which  the  reinforcement  values  are  binomially  or  normally  distributed; 
in  addition,  it  contains  a  non-parametric  statistics  module  for  use  when  there  is 
no  known  model  of  the  distribution  of  reinforcement  values.  A  statistics  module 
supplies  the  following  functions: 

age(h):  The  number  of  times  the  behavior,  as  a  whole,  has  taken  the  action  that 
would  have  been  taken  had  hypothesis  h  been  used  to  generate  the  action. 

er(h):  A  point  estimate  of  the  expected  reinforcement  received  given  that  the  action 
taken  by  the  behavior  agrees  with  the  one  that  would  have  been  generated 
had  hypothesis  h  been  used  to  generate  the  action. 

er-ub(h):  The  upper  bound  of  a  100(1  —  a)%  confidence  interval  estimate  of  the 
quantity  estimated  by  er(h). 

erp(h):  A  point  estimate  of  the  expected  reinforcement  received  given  that  hypoth¬ 
esis  h  was  used  to  generate  the  action  that  resulted  in  the  reinforcement. 

erp-ub(h):  The  upper  bound  of  a  100(1  —  a)%  confidence  interval  estimate  of  the 
quantity  estimated  by  erp{h). 

N(h);  A  point  estimate  of  the  expected  reinforcement  received  given  that  the  action 
taken  by  the  behavior  was  0  and  hypothesis  h  would  have  generated  action  0 
as  well. 
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S(h ):  A  point  estimate  of  the  expected  reinforcement  received  given  that  the  action 
taken  by  the  behavior  was  1  and  hypothesis  h  would  have  generated  action  1 
as  well. 


7.4  Evaluating  Inputs 

Each  time  the  evaluation  function  is  called,  the  most  predictive  working  hypothesis 
is  chosen,  by  taking  the  one  with  the  highest  value  of  pv,  defined  as 

pv{h)  =  [k  er(h) J  +  erp-ub(h )  . 

This  definition  has  the  effect  of  sorting  first  on  the  criterion  of  er,  then  breaking  ties 
based  on  the  value  of  erp-ub.  The  constant  multiplier  n  can  be  adjusted  to  make 
this  criterion  more  or  less  sensitive  to  low-order  digits  of  the  value  of  er(h).5 

What  makes  this  an  appropriate  criterion  for  choosing  the  hypothesis  with  the 
best  performance?  The  quantity  that  most  clearly  represents  the  predictive  value 
of  the  hypothesis  is  erp(h),  which  is  a  point  estimate  of  the  expected  reinforce¬ 
ment  given  that  actions  are  chosen  according  to  hypothesis  h.  Unfortunately,  this 
quantity  only  has  a  useful  value  after  the  hypothesis  has  been  chosen  to  generate 
actions  a  number  of  times.  Thus,  as  in  the  interval  estimation  algorithm,  we  make 
use  of  erp-ub(h),  the  upper  bound  of  a  confidence  interval  estimate  of  the  expected 
reinforcement  of  acting  according  to  hypothesis  h. 

So,  why  not  simply  choose  the  working  hypothesis  with  the  highest  value  of 
erp-ub(h),  similar  to  what  would  be  done  in  the  interval  estimation  algorithm?  The 
reason  lies  in  the  fact  that  in  the  GTRL  algorithm,  new  hypotheses  are  continually 
being  created.  If  it  always  chooses  hypotheses  with  high  values  of  erp-ub(h),  it  will 
be  in  danger  of  spending  nearly  all  of  its  time  choosing  hypotheses  because  little 
is  known  about  them,  rather  than  because  they  are  known  to  perform  well.  The 
value  of  er(h )  serves  as  a  filter  on  hypotheses  that  will  prevent  most  of  this  fruitless 
exploration.  The  quantity  er(h )  is  not  a  completely  accurate  estimator  of  erp(h), 
because  the  distribution  of  instances  over  which  it  is  defined  may  be  different  than 


5In  all  of  the  experiments  described  in  this  chapter,  k  had  the  value  1000. 
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the  distribution  of  input  instances  presented  to  the  entire  algorithm,6  but  it  serves 
as  a  useful  approximation.  We  can  use  er(h )  rather  than  er-ub(h)  because  the 
statistics  used  to  compute  er(h)  get  updated  even  when  h  is  not  used  to  generate 
actions,  so  that  statistic  becomes  valid  eventually  without  having  to  do  any  special 
work.  Thus,  hypotheses  that  look  good  on  the  basis  of  the  value  of  er(h )  tend  to 
get  chosen  to  act;  as  they  do,  the  value  of  erp-ub(h)  begins  to  reflect  their  true 
predictive  value.  This  method  still  spends  some  time  acting  according  to  untested 
hypotheses,  but  that  is  necessary  in  order  to  allow  the  algorithm  to  discover  the 
correct  hypothesis  initially  and  to  adjust  to  a  dynamically  changing  world.  The 
amount  of  exploration  that  actually  takes  place  can  be  controlled  by  changing  the 
rate  at  which  new  hypotheses  will  be  generated,  as  will  be  discussed  in  Section  7.7. 

Once  a  working  hypothesis  is  chosen,  it  is  used  to  evaluate  the  input  instance 
An  input  vector  i  satisfies  hypothesis  h  if  h's  formula  evaluates  to  true  under  the 
valuation  of  the  atoms  supplied  by  input  i.  If  the  input  instance  satisfies  the  chosen 
hypothesis,  action  1  is  generated;  otherwise,  action  0  is  generated. 


7.5  Managing  Hypotheses 


The  process  by  which  hypotheses  are  managed  in  the  GTRL  algorithm  can  be  divided 
into  three  parts:  adding,  promoting,  and  priming.  On  each  call  to  the  update 
function,  the  statistics  of  all  working  and  candidate  hypotheses  are  updated.  Then, 
if  it  is  time  to  do  so,  a  new  hypothesis  may  be  constructed  and  added  to  the 
candidate  list  of  some  level.  Candidate  hypotheses  that  satisfy  the  appropriate 
requirements  are  “promoted”  to  be  working  hypotheses.  Finally,  any  level  that 
has  more  working  hypotheses  than  the  constant  number  allotted  to  it  will  have  its 
working  hypothesis  list  pruned. 


6This  difference  in  distributions  depends  on  the  fact  that  er(h)  is  conditioned  on  the  agreement 
between  hypothesis  h  and  whatever  hypotheses  are  actually  being  used  to  generate  actions. 
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7.5.1  Adding  Hypotheses 

Search  in  the  GTRL  algorithm  is  carried  out  through  the  addition  of  hypotheses. 
Each  new  hypothesis  is  a  conjunction  or  disjunction  of  hypotheses  from  lower  levels.7 
On  each  update  cycle,  a  candidate  hypothesis  is  added  to  a  level  if  the  level  is  not 
yet  fully  populated  (the  total  number  of  working  and  candidate  hypotheses  is  less 
than  the  maximum  number  of  working  hypotheses)  or  if  it  has  been  a  certain  length 
of  time  since  a  candidate  hypothesis  was  last  generated  for  this  level  and  there  is 
room  for  a  new  candidate. 

If  it  is  time  to  generate  a  new  hypothesis,  it  is  randomly  decided  whether  to 
make  a  conjunctive  or  disjunctive  hypothesis.8  Once  the  combining  operator  is 
determined,  operands  must  be  chosen. 

The  following  search  heuristic  is  used  to  guide  the  selection  of  operands: 

When  making  a  conjunction,  use  operands  that  have  a  high  value  of 
necessity;  when  making  a  disjunction,  use  operands  that  have  a  high 
value  of  sufficiency. 


The  terms  necessity  and  sufficiency  have  a  standard  logical  interpretation:  P  is 
sufficient  for  Q  if  P  implies  Q;  P  is  necessary  for  Q  if  ->P  implies  ->Q  (that  is,  Q 
implies  P).  Schlimmer  follows  Duda,  Hart,  and  Nilsson  [19,20],  defining  the  logical 
sufficiency  of  evidence  E  for  hypothesis  H  as 


LS  (E,H) 


Pt(E  |  H ) 
Pt(E  |  H) 


and  the  logical  necessity  of  E  for  H  as 


LN  (E,H) 


Pr(E  |  H) 
Pr(E|H)  * 


7Terminology  is  being  abused  here  in  order  to  simplify  the  presentation.  Rather  than  conjoining 
hypotheses,  the  algorithm  actually  creates  a  new  hypothesis  whose  formula  is  the  conjunction  of  the 
formulae  of  the  operand  hypotheses.  This  use  of  terminology  should  not  cause  any  confusion. 

8Schlimmer’s  STAGGER  system  generates  new  hypotheses  in  response  to  errors,  using  the  nature 
of  the  error  (false  positive  vs.  true  negative)  to  determine  whether  the  new  hypothesis  should  be  a 
conjunction  or  a  disjunction.  This  method  cannot  be  applied  in  the  general  reinforcement-learning 
scenario,  in  which  the  algorithm  is  never  told  what  the  “correct”  answer  is,  making  it  unable  to 
know  whether  or  not  it  just  made  an  “error.” 
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If  E  is  truly  logically  sufficient  for  H,  then  E  implies  H ,  so  Pr(£l  |  H)  =  0,  making 
IS(E,H)  =  oo.  If  E  and  H  axe  statistically  independent,  then  LS (E,H)  =  1. 
Similarly,  if  E  is  logically  necessary  for  H ,  then  E  implies  H,  so  Pt(E  \  H)  =  0, 
making  LN(.E,  H)  =  0.  As  before,  if  E  and  H  are  independent,  LN(E,H)  =  1. 

What  makes  functions  like  these  useful  for  our  purposes  is  that  they  encode  the 
notions  of  “degree  of  implication”  and  “degree  of  implication  by.”9  Let  h*(i)  be  the 
optimal  hypothesis,  defined  by 

Wi.hm(i)  «-►  Opt(i,l)  , 

where  Opt  is  defined  as  in  Chapter  2.  We  would  like  to  use  these  same  notions  of 
necessity  and  sufficiency  to  guide  our  search,  estimating  the  necessity  and  sufficiency 
of  hypotheses  in  the  GTRL  algorithm  state  for  h* ,  the  Boolean  function  that  encodes 
the  optimal  action  policy  for  the  environment.  But,  because  of  the  reinforcement- 
learning  setting  of  our  problem,  we  have  no  access  to  or  direct  information  about 
h* — the  environment  never  tells  the  agent  which  action  it  should  have  taken. 

If  we  define  the  sufficiency  of  hypothesis  h  for  the  optimal  policy,  S(h )  as 

S(h )  =  er(i,  1  |  satisfies(i,h ))  , 

we  have  a  function  with  the  desired  properties.  If  h  implies  h*,  then 

S(h)  =  er(i,  1  |  satisfies (i,  h*))  , 

which  is  the  best  that  can  be  done  on  this  set  of  inputs,  because  whenever  action  1 
would  be  taken  by  h ,  it  would  also  be  taken  by  h*.  In  all  other  cases,  S(h)  <  S(h*), 
with  S(h)  roughly  encoding  the  degree  to  which  h  implies  h*.  If  h  and  hm  are 
completely  uncorrelated,  S(h)  is  the  expected  reinforcement  of  acting  according  to 
a  random  policy.  Similarly,  we  define  the  necessity  of  a  hypothesis  h  for  the  optimal 
policy,  N(h),  as 

N(h )  =  er(t,  0  |  satis fies(i,h))  . 


9The  LS  and  LN  functions  were  designed  for  combining  evidence  in  a  human-intuitive  way;  their 
quantitative  properties  are  crucial  to  their  correctness  and  usefulness  for  this  purpose.  The  S  and  N 
operators  that  will  be  proposed  do  not  have  the  appropriate  quantitative  properties  for  such  uses. 
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If  ->h  implies  — »/i*,  then 

N(h)  =  er(t,0  |  satisfies^,  ft*))  , 

because  whenever  action  0  would  be  taken  by  ft  it  would  be  taken  by  ft*.  In  all  other 
cases,  N(h)  <  N(h *),  with  N  roughly  encoding  the  degree  to  which  ft  is  implied  by 
ft*. 

Now  we  understand  the  definition  and  purpose  of  the  necessity  and  sufficiency 
operators,  but  what  makes  them  appropriate  for  use  as  search-control  heuristics?  In 
general,  if  we  have  a  hypothesis  that  is  highly  sufficient,  it  can  be  best  improved  by 
making  it  highly  necessary  as  well;  this  can  be  achieved  by  making  the  hypothesis 
more  general  by  disjoining  it  with  another  sufficient  hypothesis.  Similarly,  given  a 
highly  necessary  hypothesis,  we  would  like  to  make  it  more  sufficient;  we  can  achieve 
this  through  specialization  by  conjoining  it  with  another  necessary  hypothesis.  As 
a  simple  example,  consider  the  case  in  which  ft*  =  a  V  b.  In  this  case,  the  hypothesis 
a  is  logically  sufficient  for  ft*,  so  the  heuristic  will  have  us  try  to  improve  it  by 
disjoining  it  with  another  sufficient  hypothesis.  If  ft*  =  a  A  6,  the  hypothesis  a  is 
logically  necessary  for  ft*,  so  the  heuristic  would  give  preference  to  conjoining  it 
with  another  necessary  hypothesis. 

Having  decided,  for  instance,  to  create  a  new  disjunctive  hypothesis  at  level  n, 
the  algorithm  uses  sufficiency  as  a  criterion  for  choosing  operands.  This  is  done  by 
creating  two  sorted  lists  of  hypotheses:  the  first  list  consists  of  the  hypotheses  of 
level  n  —  1,  sorted  from  highest  to  lowest  sufficiency;  the  second  list  contains  all  of 
the  hypotheses  from  levels  0  to  n  —  1,  also  sorted  by  sufficiency.  The  first  list  is 
limited  in  order  to  allow  complete  coverage  of  the  search  space  without  duplication 
of  hypotheses  at  different  levels.  Thus,  for  example,  a  hypothesis  of  depth  2  can  be 
constructed  at  level  2,  but  one  of  depth  1  cannot. 

Given  the  two  sorted  lists  (another  sorting  criterion  could  easily  be  substituted 
for  necessity  or  sufficiency  at  this  point),  a  new  disjunctive  hypothesis  is  constructed 
by  syntactically  disjoining  the  formulae  associated  with  the  hypotheses  at  the  top 
of  each  list.  This  new  formula  is  then  simplified  and  put  into  a  canonical  form.10 


10The  choice  of  canonicalization  and  simplification  procedures  represents  a  tradeoff  between  com¬ 
putation  time  and  space  used  in  canonicalization  against  the  likelihood  that  duplicate  hypotheses 
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index- 1  :=  0 
index-2  :  =  0 
index-sum  :■  0 
loop 

try-hypoth (list- 1 [index- 1] , list-2 [index-2] ) ; 

index-1  :=  index-1  +  l; 

index-2  :*  index-2  -  1; 

if  index-2  *  -1  then  begin 

index-sum  :*  index-sum  +  1 
index- 1  :=  0 
index-2  :*  index-sum 

end 

end 

Figure  48:  Code  to  generate  the  best  new  hypothesis. 

If  the  simplified  formula  is  of  depth  less  than  n  it  is  discarded,  because  if  it  is 
important,  it  will  occur  at  a  lower  level  and  we  wish  to  avoid  duplication.  If  it  is 
of  depth  n,  it  is  tested  for  syntactic  equality  against  all  other  hypotheses  at  level 
n.  If  the  hypothesis  is  not  a  syntactic  duplicate,  it  is  added  to  the  candidate  list 
of  level  n  and  its  statistics  are  initialized.  If  the  new  hypothesis  is  too  simple  or 
is  a  duplicate,  two  new  indices  into  the  sorted  lists  axe  chosen  and  the  process  is 
repeated.  The  new  indices  are  chosen  so  that  the  algorithm  finds  the  non-duplicate 
disjunction  made  from  a  pair  of  hypotheses  whose  sum  of  indices  is  least..  This  is 
achieved  by  the  code  shown  in  Figure  48.  The  complexity  of  this  process  can  be 
controlled  by  limiting  the  total  number  of  new  hypotheses  that  can  be  tried  before 
giving  up.  In  addition,  given  such  a  limit,  it  is  possible  to  generate  only  prefixes 
of  the  sorted  operand-lists  that  are  long  enough  to  support  the  desired  number  of 
attempts. 

will  not  be  detected.  Any  process  for  putting  Boolean  formulae  into  a  normal  form  that  reduces 
semantic  equivalence  to  syntactic  equivalence  has  exponential  worst-case  time  and  space  complexity 
in  the  original  size  of  the  formula.  The  GTRL  algorithm  currently  uses  a  very  simple  simplification 
process  whose  complexity  is  linear  in  the  original  size  of  the  formula  and  that  seems,  empirically,  to 
work  well.  This  simplification  process  is  described  in  detail  in  Appendix  B. 
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7.5.2  Promoting  Hypotheses 

On  each  update  phase,  the  candidate  hypotheses  are  considered  for  promotion.  The 
reason  for  dividing  the  candidate  hypotheses  from  the  working  hypotheses  is  to  be 
sure  that  they  have  gathered  enough  statistics  for  their  values  of  N,  S,  and  er  to 
be  fairly  accurate  before  they  enter  the  pool  from  which  operands  and  the  action¬ 
generating  hypothesis  are  chosen.  Thus,  the  criterion  for  promotion  is  simply  the 
age  of  the  hypothesis,  which  reflects  the  accuracy  of  its  statistics.  Any  candidate 
that  is  old  enough  is  moved,  on  this  phase,  to  the  working  hypothesis  list. 


7.5.3  Pruning  Hypotheses 

After  candidates  have  been  promoted,  the  total  number  of  working  hypotheses  in  a 
level  may  exceed  the  preset  limit.  If  this  happens,  the  working  hypothesis  list  for  the 
level  is  primed.  An  hypothesis  can  play  an  important  role  in  the  GTRL  algorithm  for 
three  reasons:  its  prediction  value  is  high,  making  it  useful  for  choosing  actions;  its 
sufficiency  is  high,  making  it  useful  for  combining  into  disjunctions;  or  its  necessity 
is  high,  making  it  useful  for  combining  into  conjunctions.  For  these  reasons,  we 
adopt  the  following  pruning  strategy: 


To  prune  down  to  n  hypotheses,  first  choose  the  n/Z  hypotheses  with  the 
highest  predictive  value;  of  the  remaining  hypotheses,  choose  the  n/Z  with 
the  highest  necessity ;  and,  finally,  of  the  remaining  hypotheses,  choose 
the  n/Z  with  the  highest  sufficiency. 

This  pruning  criterion  is  applied  to  all  but  the  bottom-most  and  top-most  levels. 
Level  0,  which  contains  the  atomic  hypotheses  and  their  negations,  must  never  be 
pruned,  or  the  capability  of  generating  the  whole  space  of  fixed-size  Boolean  for¬ 
mulae  will  be  lost.  Because  its  hypotheses  will  not  undergo  further  recombination, 
the  top  level  is  primed  so  as  to  retain  the  n  most  predictive  hypotheses. 
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7.6  Parameters  of  the  Algorithm 

The  GTRL  algorithm  is  highly  configurable,  with  its  complexity  and  learning  ability 

controlled  by  the  following  parameters: 

Lx  The  number  of  levels  of  hypotheses. 

za/2:  The  size  of  the  confidence  interval  used  to  generate  erp-ub. 

H(l):  The  maximum  number  of  working  hypotheses  per  level;  can  be  a  function  of 
level  number,  /. 

C(l ):  The  maximum  number  of  candidate  hypotheses  per  level;  can  be  a  function 
of  level  number,  /. 

P A:  The  age  at  which  candidate  hypotheses  are  promoted  to  be  working  hypotheses. 

Rx  The  rate  at  which  new  hypotheses  are  generated;  every  R  ticks,  for  each  level,  l, 
if  there  are  not  more  than  C(l )  candidate  hypotheses,  a  new  one  is  generated. 

Tx  The  maximum  number  of  new  hypotheses  that  are  tried,  in  a  tick,  to  find  a 
non-duplicate  hypothesis. 

M x  The  number  of  input  bits. 

Because  level  0  is  fixed,  we  have  H( 0)  =  2 M  +  2. 


7.7  Computational  Complexity 

The  space  complexity  of  the  GTRL  algorithm  is 

+  cun 2’)  ; 

1=0 

for  each  level  j  of  the  L  levels,  there  are  H(j)  +  C(j )  working  and  candidate  hy¬ 
potheses,  each  of  which  has  size  at  most  2?  for  the  Boolean  expression,  plus  a 
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constant  amount  of  space  for  storing  the  statistics  associated  with  the  hypothesis. 
This  expression  can  be  simplified,  if  H  and  C  are  independent  of  level,  to 

0(L(H  +  C)( 2l+1-1))  . 


which  is 

0(L(H  +  C)2l )  . 


The  time  complexity  for  the  evaluation  function  is 

0(£H(i)  +  2L)  ; 

1=0 

the  first  term  accounts  for  spending  a  constant  amount  of  time  examining  each 
working  hypothesis  to  see  which  one  has  the  highest  predictive  value.  Once  the  most 
predictive  working  hypothesis  is  chosen,  it  must  be  tested  for  satisfaction  by  the 
input  instance;  this  process  takes  time  proportional  to  the  size  of  the  expression,  the 
maximum  possible  value  of  which  is  2L.  If  H  is  independent  of  level,  this  simplifies 
to 

0(LH  +  2l)  . 


The  expression  for  computation  time  of  the  update  function  is  considerably  more 
complex.  It  is  the  sum  of  the  time  taken  to  update  the  statistics  of  all  the  working 
and  candidate  hypotheses  plus,  for  each  level,  the  time  to  add  hypotheses,  promote 
hypotheses,  and  prune  hypotheses  for  the  level. 

The  time  to  update  the  hypotheses  is  the  sum  of  the  times  to  update  the  indi¬ 
vidual  hypotheses.  The  update  phase  requires  that  each  hypothesis  be  tested  to  see 
if  it  is  satisfied  by  the  input.  This  testing  requires  time  proportional  to  the  size  of 
the  hypothesis.  Thus  we  have  a  time  complexity  of 

o(t(nu ) + can •*) 

1=0 


which  simplifies  to 

0(L(H  +  C)2l)  . 

The  time  to  add  hypotheses  consists  of  the  time  to  create  the  two  sorted  lists 
(assumed  to  be  done  in  n  log  n  time  in  the  length  of  the  list)  plus  the  number  of  new 
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hypotheses  tried  times  the  amount  of  time  to  construct  and  test  a  new  hypothesis 
for  duplication.  This  time  is,  for  level  j, 

0(H (j  -  1) log H (j  - 1)  +  (£  H(k))log(£  H(k))  +  T2j(H(j)  +  C(j )))  . 

k=0  k=Q 

The  last  term  is  the  time  for  testing  new  hypotheses  against  old  ones  at  the  same 
level  to  be  sure  there  are  no  duplicates.  Testing  for  syntactic  equality  takes  time 
proportional  to  the  size  of  the  hypothesis  and  must  be  done  against  all  working  and 
candidate  hypotheses  in  level  j.  There  is  no  explicit  term  for  simplification  of  newly 
created  hypotheses  because  GTRL  uses  a  procedure  that  is  linear  in  the  size  of  the 
hypothesis. 

The  time  to  promote  hypotheses  is  simply  proportional  to  the  number  of  candi¬ 
dates,  C(j). 

Finally,  the  time  to  prune  hypotheses  is  3  times  the  time  to  choose  the  H(j)/ 3 
best  hypotheses  which,  for  the  purpose  of  developing  upper  bounds,  is  H(j)  log  H(j). 

Summing  these  expressions  for  each  level  and  making  the  simplifying  assumption 
that  H  and  C  do  not  vary  with  level  yields  a  time  complexity  of 

0(L(H  log  H  +  LH  log (LH)  +  T2L( H  +  C)  +  C  +  H  log  H ))  , 
which  can  be  further  simplified  to 

0(L2Hlog(LH)  +  T2LL(H  +  C))  .  (10) 

The  time  complexity  of  the  statistical  update  component,  0(L(H  +  C)2L),  is  domi¬ 
nated  by  the  second  term  above,  making  expression  10  above  the  time  complexity  of 
the  entire  update  function.  This  is  the  complexity  of  the  longest  possible  tick.  The 
addition  and  pruning  of  hypotheses,  which  are  the  most  time-consuming  steps,  will 
happen  only  once  every  R  ticks.  Taking  this  into  account,  we  get  a  kind  of  “average 
worst-case”  total  complexity  (the  average  is  guaranteed  when  taken  over  a  number 
of  ticks,  rather  than  being  a  kind  of  expected  complexity  based  on  assumptions 
about  the  distribution  of  inputs)  of 

0(L(H  +  C)2l  +  ±L2H  log  (LH)  +  j2lL(H  +  C ))  . 
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The  complexity  in  the  individual  parameters  is  0(2L ),  0(H  log  H),  0(l/R),  0(T ), 
0(C).  Clearly,  the  number  of  levels  and  the  number  of  hypotheses  per  level  have 
the  greatest  effect  on  total  algorithmic  complexity.11 


7.8  Choosing  Parameter  Values 

This  section  will  explore  the  relationship  between  the  settings  of  parameter  values 
and  the  learning  abilities  of  the  GTRL  algorithm. 

7.8.1  Number  of  Levels 

Any  Boolean  function  can  be  written  with  a  wide  variety  of  syntactic  expressions. 
Consider  the  set  of  Boolean  formulae  with  the  negations  driven  in  as  far  as  possible, 
using  DeMorgan’s  laws.  The  depth  of  such  a  formula  is  the  maximum  nesting  depth 
of  binary  conjunction  and  disjunction  operators  within  the  formula.  The  depth  of  a 
Boolean  function  is  defined  to  be  the  depth  of  the  shallowest  Boolean  formula  that 
expresses  the  function. 

An  instance  of  the  GTRL  algorithm  with  L  levels  of  combination  is  unable  to 
learn  functions  with  depth  greater  than  L.  Whether  it  can  learn  all  functions  of 
depth  L  or  less  depends  on  the  settings  of  other  parameters  in  the  algorithm.  The 
time  and  space  complexities  of  the  algorithm  are,  technically,  most  sensitive  to  this 
parameter,  both  being  exponential  in  the  number  of  levels.  However,  in  practical 
applications  of  this  algorithm,  H  is  usually  considerably  larger  than  2L. 

7.8.2  Number  of  Working  and  Candidate  Hypotheses 

The  choice  of  the  size  of  the  hypothesis  lists  at  each  level  also  has  a  great  effect 
on  the  overall  complexity  of  the  algorithm.  The  working  hypothesis  list  needs 
to  be  at  least  big  enough  to  hold  all  of  the  subexpressions  of  some  formula  that 
describes  the  target  function.  Thus,  in  order  to  learn  the  function  described  by 

uThis  complexity  is  not  as  bad  as  it  may  look,  because  2L  is  just  the  length  of  the  longest  formula 
that  can  be  constructed  by  the  algorithm.  The  time  and  space  complexities  are  linear  in  this  length. 
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*o  A  (t’i  V  i2)  A  (13  V  -u'4),  level  1  must  have  room  for  at  least  two  working  hypotheses, 
*1  V  *2  and  z2  V  -<i 4,  and  levels  2  and  3  must  have  room  for  at  least  one  working 
hypothesis  each. 

This  amount  of  space  will  rarely  be  sufficient,  however.  There  must  also  be  room 
for  newly  generated  hypotheses  to  stay  until  they  are  tested  and  proven  or  disproven 
by  their  performance  in  the  environment.  Exactly  how  much  room  this  is  depends 
on  the  rate,  R,  at  which  new  hypotheses  are  generated  and  on  the  size,  za/ 2,  of  the 
confidence  intervals  used  to  generate  erp-ub.  To  see  this,  consider  the  case  in  which 
a  representation  of  the  optimal  hypothesis,  hm,  has  already  been  constructed.  The 
algorithm  continues  to  generate  new  hypotheses,  one  every  R  ticks,  with  each  new 
hypothesis  requiring  an  average  of  j  ticks  to  be  proven  to  be  worse  than  hm.  That 
means  there  must  be  an  average  of  R/j  slots  for  extra  hypotheses  at  this  level.  Of 
course,  it  is  likely  that  during  the  course  of  a  run,  certain  non-optimal  hypotheses 
will  take  more  than  j  ticks  to  disprove.  This  can  cause  hm  to  be  driven  out  of 
the  hypothesis  list  altogether  during  the  pruning  phase.  Thus,  a  more  conservative 
strategy  is  to  prevent  this  by  increasing  the  size  of  the  hypothesis  lists,  but  at  a 
penalty  in  computation  time. 

Even  when  there  is  enough  space  for  all  subexpressions  and  their  competitors  at 
each  level,  it  is  possible  for  the  size  of  the  hypothesis  lists  to  affect  the  speed  at  which 
the  optimal  hypothesis  is  generated  by  the  algorithm.  This  can  be  easily  understood 
in  the  context  of  the  difficulty  of  a  function  for  the  algorithm.  Intuitively,  functions 
whose  subexpressions  are  not  naturally  preferred  by  the  necessity  and  sufficiency 
search  heuristics  are  difficult  for  the  GTRL  algorithm  to  construct.  In  such  cases, 
the  algorithm  is  reduced  to  randomly  choosing  expressions  at  each  level. 

Consider  the  case  in  which  h *  =  (t0  A  ->ij)  V  (-ii0  A  ti),  an  exclusive-or  function. 
Because  h*  neither  implies  nor  is  implied  by  any  of  the  input  bits,  the  atoms  will 
all  have  similar,  average  values  of  N  and  S.  Due  to  random  fluctuations  in  the 
environment,  different  atoms  will  have  higher  values  of  N  and  S  at  different  times 
during  a  rim.  Thus,  the  conjunctions  and  disjunctions  at  level  1  will  represent  a 
sort  of  random  search  through  expression  space.  This  random  search  will  eventually 
generate  one  of  the  following  expressions:  i0  A  -n’i,  -u’o  A  *i,  io  V  t’i,  ~>io  V  — .  When 
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one  of  these  is  generated,  it  will  be  retained  in  the  level  1  hypothesis  list  because  of 
its  high  necessity  or  sufficiency.  We  need  only  wait  until  the  random  combination 
process  generates  its  companion  subexpression,  and  they  will  be  combined  into  a 
representation  of  /i*  at  level  2. 

Even  with  very  small  hypothesis  lists,  the  correct  answer  will  eventually  be 
generated.  However,  as  problems  become  more  difficult,  the  probability  that  the 
random  process  will,  on  any  given  tick,  generate  the  appropriate  operands  becomes 
very  small,  making  the  algorithm  arbitrarily  slow  to  converge  to  the  correct  answer. 
This  process  can  be  made  to  take  fewer  ticks  by  increasing  the  size  of  the  hypothesis 
list.  In  the  limit,  the  hypothesis  list  will  be  large  enough  to  hold  all  conjunctions 
and  disjunctions  of  atoms  at  the  previous  level  and  as  soon  as  it  is  filled,  the  correct 
building  blocks  for  the  next  level  will  be  available  and  apparent. 

We  can  measure  the  overall  difficulty  of  a  function  for  the  GTRL  algorithm  in 
the  context  of  a  particular  distribution  of  input  instances  by  measuring  the  degree 
to  which  the  individual  input  bits  are  necessary  or  sufficient  for  the  function.  We 
can  define  the  difficulty  of  function  /,  D(f),  as 


D(f)  = 


For  each  positive  atom,  the  lack  of  sufficiency  or  necessity  makes  the  problem  more 
difficult;  the  term  min(L5(|  jj,  Ls^t  jj )  measures  the  degree  to  which  the  atom 
and  its  negation  are  insufficient;  the  term  min(LN(ij,  /),  LN(->ij,  /))  measures  the 
degree  to  which  the  atom  and  its  negation  are  unnecessary  (recall  that  high  values 
of  LS  indicate  sufficiency  and  low  values  of  LN  indicate  necessity).  Given  that 
LS(a,b)  =  LN(->a,b),  we  can  simplify  the  definition  to 


W) 


=  II  (min( 
i<M  \ 


)  +  min(LS(ij,  /),  LS(-<i 


hJ)))  ■ 


In  this  form,  the  difficulty  of  the  function  true  would  be  2 M,  where  M  is  the  number 
of  input  bits,  because  each  of  the  bits  is  unnecessary  and  insufficient  for  the  function. 
We  can  correct  for  irrelevant  input  bits  by  subtracting  2  for  every  bit  that  has  no 
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effect  on  the  value  of  /,  yielding 

= h{LsUJjy  l5H“o) + ~  20  ’ 

where  C  is  the  number  of  input  bits  that  have  no  effect  on  the  value  of  /. 

The  definition  uses  LS  and  LN  rather  than  S  and  N,  because  LS  and  LN  have 
well-understood  ranges,  with  values  of  1  indicating  lack  of  necessity  and  sufficiency. 
Because  5  and  N  are  monotonic  in  LS  and  LN,  distinctions  that  are  apparent 
when  using  LS  and  LN,  which  is  what  are  measured  by  D,  will  also  be  apparent 
when  using  S  and  N.  When  the  input  bits  all  have  an  effect  on  the  value  of  /,  but 
are  completely  unnecessary  and  insufficient  for  /,  its  difficulty  will  be  2 M. 

The  values  of  LS  and  LN  depend  on  being  able  to  evaluate  the  probability  of  a 
particular  input  vector  arriving;  thus,  this  measure  assumes  that  there  is  some  fixed 
distribution  on  the  input  vectors.  If  there  is  no  such  fixed  distribution  (as  we  have 
argued  may  not  be  the  case  in  many  embedded  learning  scenarios),  the  difficulty 
could  be  defined  to  be  the  supremum  over  all  possible  distributions. 

This  difficulty  measure  can  be  illustrated  by  considering  the  space  of  possible 
Boolean  functions  on  three  input  bits,  in  which  the  individual  input  vectors  are 
assumed  to  be  uniformly  distributed.  Following  Schlimmer  [62],  the  set  of  3-input 
Boolean  functions  can  be  divided  into  19  classes,  which  are  equivalence  classes  under 
permutation  and  negation  of  the  input  bits.  Table  7  uses  Schlimmer’s  numbering 
system,  giving  a  representative  function  from  each  class  and  its  D  measure.  The 
classes,  going  from  easy  to  difficult  are  ordered  as  follows: 

{0, 4e,  8},  {2c,  6c},  {1,7},  {4d},  {36, 55},  {4c},  {3a,  5a},  {4a},  {26, 66},  {46},  {2a,  6a}  . 

Interestingly,  all  functions  with  difficulty  less  than  3  are  linearly  separable  and  those 
with  difficulty  greater  than  3  are  not.  Also,  D  seems  to  measure  the  difficulty  of 
problems  for  STAGGER  more  accurately,  in  many  cases,  than  the  measure  used  by 
Schlimmer.12 

“Schlimmer  used  a  measure  of  the  dependence  of  the  concept  on  the  input  bits  which  is  based  on 
Fisher’s  [23]  work  on  category  utility. 
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Class 

/ 

W) 

0. 

false 

0.00 

1. 

a  A  6  Ac 

1.28 

2a. 

(a  A  b  A  c)  V  (“>a  A  A  ~>c) 

6.00 

2b. 

a  A  ((&  A  c)  V  (“ifc  A  “*c)) 

4.33 

2c. 

a  Ab 

0.67 

3a. 

(a  A  6)  V  (-’a  A->b  A  ~>c) 

3.47 

3b. 

a  A  (b  V  c) 

2.50 

4a. 

(a  A  ~>6)  V  (-<a  A  b) 

4.00 

4b. 

(a  A  (6  V  c))  V  (~ia  A  A  ~>c) 

4.67 

4c. 

(a  A  c)  V  (~i&  A  ~>c) 

3.33 

4d. 

(a  A  6)  V  (b  A  c)  V  (c  A  a) 

2.00 

4e. 

a 

0.00 

5a. 

(a  V  b)  A  ( ->a  V  -*b  V  ->c) 

3.47 

5b. 

a  V  (fc  A  c) 

2.50 

6a. 

(a  V  6  V  c)  A  (- >a  V  V  ->c) 

6.00 

6b. 

a  V  ((6  V  c)  A  V  ^c)) 

4.33 

6c. 

a  V  6 

0.67 

7. 

a  V  6  V  c 

1.28 

8. 

true 

0.00 

Table  7:  Difficulties  of  classes  of  3-input  Boolean  functions. 
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7.8.3  Promotion  Age 

The  choice  of  values  for  the  age  parameter  depends  on  how  long  it  takes  for  the 
N ,  and  S  statistics  to  come  to  be  a  good  indication  of  the  values  they  are 
estimating.  If  reinforcement  has  a  high  variance,  for  instance,  it  may  take  more 
examples  to  get  a  true  statistical  picture  of  the  underlying  processes.  If  the  value  of 
R  is  large,  causing  new  combinations  to  be  made  infrequently,  it  is  often  important 
for  promotion  age  to  be  large,  ensuring  that  the  data  that  guides  the  combinations 
is  accurate.  If  R  is  small,  the  effect  of  occasional  bad  combinations  is  not  so  great 
and  may  be  outweighed  by  the  advantage  of  moving  candidate  hypotheses  more 
quickly  to  the  working  hypothesis  list. 


7.8.4  Rate  of  Generating  Hypotheses 

The  more  frequently  new  hypotheses  are  generated,  the  sooner  the  algorithm  will 
construct  important  subexpressions  and  the  more  closely  it  will  track  a  changing 
environment.  However,  each  new  hypothesis  that  has  a  promising  value  of  er  will 
be  executed  a  number  of  times  to  see  if  its  value  of  erp  is  as  high  as  that  of  the 
current  best  hypothesis.  In  general,  most  of  these  hypotheses  will  not  be  as  good  as 
the  best  existing  one,  so  using  them  to  choose  actions  will  decrease  the  algorithm’s 
overall  performance  significantly. 


7.8.5  Maximum  New  Hypothesis  Tries 

The  attempt  to  make  a  new  hypothesis  can  fail  for  two  reasons.  Either  the  newly- 
created  hypothesis  already  exists  in  the  working  or  candidate  hypothesis  list  of  the 
level  for  which  it  was  created  or  the  expression  associated  with  the  hypothesis  was 
subject  to  one  of  the  reductions  of  Appendix  B,  causing  it  to  be  inappropriate  for 
this  level.  It  is  possible,  but  very  unlikely,  to  have  more  than  H  +  C  failures  of  the 
first  type.  The  number  of  failures  of  the  second  type  is  harder  to  quantify. 
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7.9  Empirical  Results 

This  section  describes  a  set  of  experiments  with  the  GTRL  algorithm.  First,  the 
operation  of  the  GTRL  algorithm  is  illustrated  by  discussing  a  sample  run.  Then, 
the  dependence  of  the  algorithm’s  performance  on  the  settings  of  its  parameters 
is  explored.  Finally,  the  performance  of  the  GTRL  algorithm  is  compared  with  the 
algorithms  of  the  previous  chapter  on  Tasks  5,  6,  and  7. 

7.9.1  Sample  Run 

Figure  49  shows  the  trace  of  a  sample  run  of  the  GTRL  algorithm.  It  is  executed 
on  Task  8,  a  binomial  Boolean-expression  world13  with  3  input  bits,  in  which  the 
expression  is  (boVbi)  A(£>i  Vb2),  pi,  =  .9,  pi„  =  .1,  po»  =  .1,  and  pon  =  .9.  The  figure 
shows  the  state  of  the  algorithm  at  ticks  50,  100,  and  250.  The  report  for  each  tick 
shows  the  working  hypotheses  for  each  level,  together  with  their  statistics.  14  In 
order  to  save  space  in  the  figure,  only  the  four  most  predictive  working  hypotheses 
are  shown  at  each  level.  At  tick  50,  the  two  component  hypotheses,  b0Abj  and  bxAb2, 
have  been  constructed.  They  both  have  high  levels  of  sufficiency,  which  makes  them 
good  operands  for  disjunction.  By  tick  100,  the  correct  disjunction  has  been  made, 
and  the  most  predictive  hypothesis  is  the  optimal  hypothesis  ( b0  A  &i)  V  A  b2). 
At  tick  250,  the  optimal  hypothesis  is  still  winning  and  the  average  reinforcement 
is  approaching  optimal. 

7.9.2  Effects  of  Parameter  Settings  on  Performance 

The  section  describes  a  set  of  experiments  that  illustrate  how  learning  performance 
varies  as  a  function  of  the  values  of  the  parameters  PA,  R,  and  H  on  Task  8, 
which  was  described  in  the  previous  section.  The  parameter  L  was  set  to  3,  zQ/ 2 
to  2,  C  to  be  equal  to  H,  and  T  to  100.  Figures  50,  51,  and  52  show  the  results, 

13Binomial  Boolean-expression  worlds  are  defined  in  Section  6.5.1. 

14The  age  statistic  reported  in  the  trace  is  the  number  of  times  the  hypothesis  has  been  chosen 
to  generate  actions,  rather  than  the  value  of  age,  which  is  the  number  of  times  this  hypothesis  has 
agreed  with  the  ones  that  have  been  chosen  to  generate  actions. 
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•***♦♦  Tick  SO  Sn—nry 

- Level  0 - 

PV  *  850.9243  EPPUB  *  0.92  KP 
PV  «  834.0000  EPPUB  *  1.00  EP 
PV  ■  770.0000  EPPUB  ■  1.00  EP 
PV  ■  751.0000  EPPUB  ■  1 .00  EP 

- Level  1 - 

PV  ■  904.9776  EPPUB  «  0.98  EP 
PV  «  894.9699  EPPUB  *  0.97  EP 
PV  *  882.8500  EPPUB  «  0.85  KP 
PV  -  847.0000  EPPUB  *  1.00  EP 

- - Level  2 - 

PV  »  866.9055  EPPUB  ■  0.91  EP 
PV  ■  819.0000  EPPUB  »  1.00  EP 
PV  *  728.0000  EPPUB  «  1.00  EP 
eee  Be  inf  ■  C  37  /  50) 

••eeee  Tick  100  Sunn&ry 

- Level  0 - 

PV  ■  898.9243  EPPUB  *  0.92  EP 
PV  «  876.0000  EPPUB  *  1.00  EP 
PV  «  850.0000  EPPUB  «  1.00  EP 
PV  ■  844.0000  EPPUB  *  1.00  EP 

- Level  1 - 

PV  «  931.9699  EPPUB  -  0.97  EP 
PV  ■  927.9801  EPPUB  »  0.98  EP 
PV  -  914.0000  EPPUB  «  1.00  EP 
PV  «  911.8500  EPPUB  ■  0.85  EP 

- Level  2 - 

PV  «  962.9706  EPPUB  «  0.97  EP 
PV  «  947.9055  EPPUB  -  0.91  EP 
PV  «  945.7935  EPPUB  «  0.79  EP 
PV  «  940.0000  EPPUB  «  1.00  EP 
••*  Be inf  *  (  45  /  50) 

*••••*  Tick  250  Sn— wry 
— — Level  0- — - 
PV  ■  925.9243  EPPUB  *  0.92  EP 
PV  «  891.0000  EPPUB  «  1.00  EP 
PV  *  886.0000  EPPUB  «  1.00  EP 
PV  «  886.0000  EPPUB  *  1.00  EP 

- Level  1 — — 

PV  ■  927.0000  EPPUB  ■  1 .00  EP 
PV  »  922.9699  EPPUB  -  0.97  EP 
PV  -  921.0000  EPPUB  ■  1.00  EP 
PV  «  917.0000  EPPUB  *  1.00  EP 

- Level  2 - 

PV  -  931.9491  EPPUB  *  0.95  EP 
PV  »  928.9055  EPPUB  «  0.91  EP 
PV  ■  921.0000  EPPUB  «  1.00  EP 
PV  *  916.8500  EPPUB  *  0.85  EP 
Be  inf  *  (  46  /  60) 


«  0.85  I  «  0.87  S  «  0.84  AGE  *  14  B:  1 

■  0.83  B  *  0.69  S  -  0.94  AGE  *  0  H:  2 

«  0.77  1  ■  0.75  S  »  0.78  AGE  «  01:  0 

*  0.75  I  *  0.75  S  *  eeee  AGE  *  0  H:  f 

■  0.90  ■  *  0.85  S  -  1.00  AGE  «  8  H:  (and  1  2) 

■  0.89  ■  ■  0.80  S  *  1.00  AGE  *  6  H:  (and  0  1) 

*  0.88  I  »  1.00  S  ■  0.87  AGE  -  41:  (or  0  2) 

■  0.85  I  *  0.87  8  *  0.90  AGE  -  OH:  (or  1  (not  0)) 

»  0.87  ■  *  0.75  S  *  0.91  AGE  »  2  H:  (or  (and  1  2)  (or  1  2)) 

*  0.82  V  *  1.00  8  ■  0.78  AGE  ■  OH:  (or  0  (and  1  2)) 

-  0.73  I  ■  S  «  0.73  AGE  ■  OH:  (or  0  (or  1  2)) 

74.00%  Long  tern  ■  (  37  /  50)  74.00%  *♦♦ 


*  0.90  V  *  0.90  S  ■  0.90  AGE  *  14  H:  1 

*  0.87  V  -  0.81  S  *  0.94  AGE  *  0  H:  2 

*  0.85  ■  *  0.85  S  *  eeee  AGE  *  0  H:  f 

*  0.84  ■  -  0.88  S  -  0.81  AGE  »  0  H:  0 

*  0.93  I  »  0.90  8  -  1.00  AGE  *  6  H:  (and  0  1) 

■  0.93  I  ■  0.89  S  ■  1.00  AGE  ■  9  H:  (and  1  2) 

*  0.91  V  -  0.91  S  -  0.93  AGE  *  3  H:  (and  2  (not  0)) 

-  0.91  I  *  1.00  S  ■  0.88  AGE  *  4  H:  (or  0  2) 

■  0.96  I  *  0.94  S  ■  1.00  AGE  -  19  H:  (or  (and  0  1)  (and  1  2)) 

*  0.95  I  *  0.96  S  »  0.92  AGE  *  2  H:  (and  (or  0  2)  (or  1  (not  2))) 

«  0.95  H  ■  0.96  S  *  0.92  AGE  «  1  H:  (or  (and  0  1)  (and  2  (not  0))) 

«  0.94  I  *  0.94  S  *  0.94  AGE  «  OH:  (and  (or  0  1)  (or  0  2)) 

90.00%  Long  tern  ■  (  82  /  100)  82.00%  eee 


*  0.93  V  *  0.93  S  -  0.92  AGE  -  14  B:  1 

*  0.89  V  *  0.89  S  *  0.89  AGE  «  0  H:  0 

-  0.89  I  -  0.95  S  -  0.80  AGE  -  OH:  (not  2) 

*  0.89  I  «  0.89  S  -  eeee  AGE  *  0  H:  f 

»  0.93  V  -  0.96  8  -  0.91  AGE  *  OH:  (or  1  (not  2)) 

-  0.92  V  ■  0.91  8  -  0.95  AGE  ■  8  H:  (and  0  1) 

-  0.92  I  *  0.92  8  *  0.92  AGE  -  OH:  (or  1  (not  0)) 

«  0.92  H  -  0.95  S  ■  0.90  AGE  *  OH:  (or  0  1) 

■  0.93  H  *  0.92  S  *  0.95  AGE  «  166  H:  (or  (and  0  1)  (and  1  2)) 

-  0.93  I  *  0.93  8  -  0.93  AGE  -  2  H:  (and  (or  0  2)  (or  1  (not  2))) 

*  0.92  ■  *  0.90  8  «  0.94  AGE  «  OH:  (and  (or  0  1)  (or  0  2)) 

*  0.92  I  «  0.91  8  ■  0.92  AGE  *  4  H:  (or  (and  0  1)  (and  2  (not  0))) 

92.00%  Long  tem  ■  (  219  /  250)  87.60%  eee 


Figure  49:  A  sample  run  of  the  GTR.L  algorithm. 
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Figure  50:  Performance  versus  parameter  value  PA  for  Task  8. 


plotting  average  reinforcement  per  tick  on  100  runs  of  length  3000  against  each  of 
the  remaining  parameters,  PA,  R,  and  H. 


The  expected  reinforcement  is  maximized  at  a  low  value  of  PA,  the  promotion 
age  of  candidate  hypotheses,  because  it  is  relatively  easy  to  discriminate  between 
good  and  bad  actions  in  Task  8.  When  the  probabilities  of  receiving  reinforcement 
value  1  axe  closer  to  one  another,  as  they  are  in  the  tasks  discussed  in  the  next 
section,  it  becomes  necessary  to  use  higher  values  of  PA.  Because  this  task  (and  all 
of  the  others  discussed  in  this  chapter)  is  stationary,  the  only  reason  to  have  a  low 
value  of  R,  the  inverse  of  the  rate  at  which  new  hypotheses  are  generated,  is  if  the 
function  is  very  difficult  and  hypothesis  list  is  too  small  to  hold  all  subexpressions 
at  once.  This  is  not  the  case  for  Task  8,  so  high  values  of  R  are  desirable.  Finally, 
performance  increases  with  the  length  of  the  hypothesis  lists,  H,  in  every  task. 
Because  this  task  is  relatively  easy,  however,  the  correct  answer  is  usually  found 
fairly  quickly  with  even  small  values  of  H,  so  the  increase  is  not  dramatic  (this  is 
evidenced  by  the  small  range  of  er  in  Figure  52.) 
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Figure  51:  Performance  versus  parameter  value  R  for  Task  8. 


Figure  52:  Performance  versus  parameter  value  H  for  Task  8. 
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Task/Par  am 

PA 

R 

H 

Results 

5 

35 

200 

30 

.5648 

6 

10 

100 

30 

.7879 

7 

45 

110 

20 

.7877 

Table  8:  Best  parameter  values  for  GTRL  on  Tasks  5,  6,  and  7  from  Chapter  6. 

7.9.3  Comparison  with  Other  Algorithms 

The  GTRL  algorithm  was  tested  on  Tasks  5,  6,  and  7  from  Chapter  6.  The  best  values 
of  the  parameters  for  each  task  were  determined  through  extensive  testing,  and  are 
shown  in  Table  8.  Some  of  the  values  are  arbitrarily  cut  off  where  the  parameter 
testing  stopped.  For  instance,  performance  on  Task  5  might  be  improved  with  higher 
values  of  PA  and  performance  on  Task  6  would  be  improved  with  higher  values  of  H . 
The  average  reinforcement  per  tick  of  executing  GTRL  at  these  parameter  settings 
on  100  runs  of  length  3000  are  shown  in  the  final  column  of  the  table. 

Figure  53  is  a  modified  version  of  Figure  42,  with  the  results  of  the  GTRL  al¬ 
gorithm  included  with  those  of  the  algorithms  of  Chapter  6  for  Tasks  5,  6,  and  7. 
On  Tasks  5  and  6,  the  GTRL  algorithm  performs  significantly  better  than  the  LARC, 
LARC-f,  and  BP  algorithms,  but  not  as  well  as  IE,  IEKDNF,  or  LARCKDNF.  Finally, 
on  Task  7,  the  real  advantage  of  GTRL  is  illustrated.  On  a  task  with  a  large  number 
of  inputs,  GTRL  works  efficiently  and  is  significantly  outperformed  only  by  IEKDNF. 

The  learning  curves  of  GTRL  on  each  of  the  tasks  are  shown  in  Figures  54,  55 
and  56.  They  Eire  superimposed  on  the  learning  curves  of  the  algorithms  tested  in 
Chapter  6;  the  GTRL  curves  are  drawn  in  bold  lines. 

This  comparison  is,  to  some  degree,  unfair,  because  the  GTRL  algorithm  is  de- 
signed  for  nonstationary  environments.  We  can  see  in  the  learning  curves  that, 
although  it  improves  quickly  early  in  run,  it  does  not  reach  as  high  a  steady-state 
level  of  performance  as  the  other  algorithms.  It  does  not  converge  to  a  fixed  state, 
because  it  is  always  entertaining  new  competing  hypotheses.  This  flexibility  causes 
a  large  decrease  in  performance.  If  the  GTRL  algorithm  is  to  be  applied  in  a  domain 
in  which  changes,  if  any,  are  expected  to  take  place  near  the  beginning  of  a  rim, 
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TASK  5  TASK  6 


•  LARC+ 

Figure  53:  Significance  of  GTRL  results  on  Tasks  5,  6,  and  7,  compared  with  the 
results  of  the  algorithms  of  Chapter  6. 
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Figure  54:  GTRL  learning  curve  for  Task  5  (bold)  compared  with  the  algorithms  of 
Chapter  6. 


opt 

ie 


Figure  55:  GTRL  learning  curve  for  Task  6  (bold)  compared  with  the  algorithms  of 
Chapter  6. 
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Figure  56:  GTRL  learning  curve  for  Task  7  (bold)  compared  with  the  algorithms  of 
Chapter  6. 

performance  can  be  improved  by  decreasing  over  time  the  rate  at  which  new  can¬ 
didate  hypotheses  are  generated.  This  will  cause  the  algorithm  to  spend  less  time 
experimenting  and  more  time  acting  on  the  basis  of  known  good  hypotheses. 


7.10  Conclusions  and  Extensions 


We  have  seen  that  the  GTRL  algorithm  can  be  used  to  learn  a  variety  of  Boolean 
function  classes  with  varying  degrees  of  effectiveness  and  efficiency.  This  chap¬ 
ter  describes  only  a  particular  instance  of  a  general,  dynamic  generate-and-test 
method — there  are  a  number  of  other  possible  variations. 

The  algorithm  is  designed  so  that  other  search  heuristics  may  be  easily  accommo¬ 
dated.  An  example  of  another,  potentially  useful,  heuristic  is  to  combine  hypotheses 
that  are  highly  correlated  with  the  optimal  hypothesis.  One  way  to  implement  this 
heuristic  would  be  to  run  a  linear-association  algorithm,  such  as  LARC,  over  the  in¬ 
put  bits  and  the  outputs  of  the  newly-created  hypotheses,  then  make  combinations 
of  those  hypotheses  that  evolve  large  weights.  It  is  not  immediately  apparent  how 
this  would  compare  to  using  the  N  and  S  heuristics. 


132 


CHAPTER  7.  A  GENERATE-AND-TEST  ALGORITHM 


Another  possible  extension  would  be  to  add  genetically-motivated  operators, 
such  as  crossover  and  mutation,  to  the  set  of  search  operators.  Many  genetic  meth¬ 
ods  are  concerned  only  with  the  performance  of  the  final  result  so  this  extension 
would  have  to  be  made  carefully  in  order  to  preserve  good  on-line  performance. 


Chapter  8 


Learning  Action  Maps  with  State 


All  of  the  algorithms  that  we  have  considered  thus  fax  axe  capable  of  learning  only 
actions  maps  that  are  pure,  instantaneous  functions  of  their  inputs.  It  is  more 
generally  the  case,  however,  that  an  agent’s  actions  must  depend  on  the  past  history 
of  input  values  in  order  to  be  effective.  By  storing  information  about  past  inputs, 
the  agent  is  able  to  induce  a  finer  partition  on  the  set  of  world  states,  allowing  it  to 
make  more  discriminations  and  to  tailor  its  actions  more  appropriately  to  the  state 
of  the  world. 

Perhaps  the  simplest  way  to  achieve  this  finer-grained  historical  view  of  the  world 
is  to  simply  remember  all  input  instances  from  the  last  k  ticks  and  present  them  in 
parallel  to  the  behavior-learning  algorithm.  This  method  has  two  drawbacks:  it  is 
not  possible  for  actions  to  depend  on  conditions  that  reach  back  arbitrarily  far  in 
history  and  the  algorithmic  complexity  increases  considerably  as  the  length  of  the 
available  history  is  increased. 

This  chapter  will  present  an  alternative  approach,  based  on  the  GTRL  algorithm, 
which  can  efficiently  learn  simple  action  maps  with  temporal  dependencies  that  go 
arbitrarily  far  back  in  history. 
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Figure  57:  Timing  diagram  for  a  set-reset  flip-flop. 

8.1  Set-Reset 

A  common  component  in  hardware  logic  design  is  a  set-reset  (SR)  flip-flop.1  It  has 
two  input  lines,  designated  set  and  reset ,  a  clock,  and  an  output  line.  Whenever  the 
clock  is  triggered,  if  the  set  line  is  high,  then  the  output  of  the  unit  is  high;  else,  if 
the  reset  line  is  high,  the  output  of  the  unit  is  low;  finally,  if  both  input  lines  are 
low,  the  output  of  the  unit  remains  the  same  as  it  was  during  the  previous  clock 
cycle.  The  value  of  the  output  is  held  in  the  determined  state  until  the  next  clock 
tick. 

The  behavior  of  an  SR  flip-flop  can  be  described  logically  in  terms  of  the  follow¬ 
ing  binary  Boolean  operator 

SR(a,  b)  =  a  V  (- >b  A  •SR(a,  6))  , 

where  •  is  the  temporal  operator  “last.”  Figure  57  shows  a  timing  diagram,  in 
which  the  top  two  lines  represent  a  time-history  of  the  values  of  wires  a  and  b  and 
the  bottom  line  represents  the  time  history  of  the  values  of  SR(a,  b),  the  output  of 
a  set-reset  flip-flop  whose  inputs  Eire  wires  a  and  b. 

In  the  logical  definition  of  SR  as  a  Boolean  operator,  no  initial  value  is  specified. 
This  problem  is  dealt  with  by  adding  a  third  logical  value,  ±,  which  means,  intu¬ 
itively,  “undefined.”  When  an  expression  of  the  form  SR(a,  b)  is  to  be  evaluated  for 
the  first  time,  it  is  assumed  that  the  value  of  •SR(a,fc)  is  _L.  The  value  J.  combines 

1  Components  of  this  kind  are  also  commonly  referred  to  as  RS  (reset-set)  flip-flops  in  the  logic- 
design  literature. 
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with  the  other  logical  values  as  follows: 


true  V  -L 
false  V  ± 
1V1 
true  A  _L 
false  A  ± 


=  true 


=  _L 
=  1 
=  _L 


=  false 


_L  A  -L  =  _L 
->±  =  ± 


Thus,  the  expression  SR(a,  b)  will  have  value  _L  until  either  a  =  true,  in  which 
case  SR(a,  b )  =  true  V  . . .  =  true,  or  a  =  false  and  b  =  true,  in  which  case 
SR(a,  b)  =  false  V  (false  A  ±)  =  false. 


8.2  Using  SR  in  GTRL 

In  the  original  version  of  the  GTRL  algorithm,  the  hypotheses  were  pure  Boolean 
functions  of  the  input  bits.  This  section  describes  an  extended  version  of  that 
algorithm,  called  GTRL-S,  which  has  simple  sequential  networks  as  hypotheses. 

8.2.1  Hypotheses 

The  GTRL-S  algorithm  is  structured  in  exactly  the  same  way  as  the  GTRL  algorithm. 
The  main  difference  is  that  SR  is  added  as  another  binary  hypothesis-combination 
operator.  This  allows  hypotheses  such  as 

SR(-i6o,  &i  A  62)  A  (61  V  SR(SR(&o,  &i)>  ”’62)  > 

which  represents  the  sequential  network  shown  in  Figure  58,  to  be  constructed. 

This  operator  does  not  allow  every  possible  sequential  circuit  to  be  generated, 
however.  In  the  pure-function  case  it  was  not  necessary  to  have  a  negation  operator 
because  DeMorgan’s  laws  guarantee  that  having  access  to  the  negated  atoms  is 
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Figure  58:  A  sample  sequential  network,  described  by  SR( A  b2)  A  (6j  V 

SR(SR(b0,bi),-'b2) 


Figure  59:  This  circuit  generates  the  sequence  0, 1,0, 1, . . because  it  has  feedback, 
it  cannot  be  constructed  by  the  GTRL-S  algorithm. 

sufficient  to  generate  any  Boolean  function.  Unfortunately,  negation  cannot  be 
moved  past  the  SR  operator  in  any  general  way,  so,  for  instance,  a  sequential  circuit 
equivalent  to  ->SR(t'o,ti)  cannot  be  generated  by  applications  of  the  SR  operator 
to  atoms  and  their  negations.  This  deficiency  can  be  simply  remedied  by  adding  a 
unary  negation  operator  or  by  adding  an  operator  NSR,  which  is  defined  as 

NSR(a,&)  =  -iSR(a,6)  . 

Another  deficiency  is  that  the  construction  of  sequential  networks  with  feedback 
is  not  allowed.  Thus,  the  circuit  shown  in  Figure  59,  which  generates  the  sequence 
0, 1, 0, 1, . . cannot  be  constructed.  For  agents  embedded  in  realistic  environments, 
this  limitation  may  not  be  too  great  in  practice.  We  would  not,  in  general,  expect 


► 
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such  agents  to  have  to  make  state  changes  that  are  not  a  function  of  changes  in  the 
world  that  are  reflected  in  the  agent’s  input  vector.  There  is  one  additional  limita¬ 
tion  that  is  both  more  serious  and  more  easily  corrected.  With  the  semantics  of  SR 
defined  as  they  are,  it  is  not  possible  to  construct  an  expression  equivalent  to  #a. 
One  way  to  solve  this  problem  would  be  to  redefine  SR(a,  b )  as  •aV(#->fcA#SR(a,  b )). 
In  that  case,  *a  could  be  expressed  as  SR(a,-ia),  but  the  search  heuristics  to  be 
used  in  GTRL-S  (described  in  Section  8.2.3)  would  no  longer  be  applicable.  Another 
option  would  be  to  add  •  as  a  unary  operator,  along  with  negation.  This  is  a  rea¬ 
sonable  course  of  action;  it  is  not  followed  in  this  chapter,  however,  both  because  it 
would  complicate  the  exposition  and  because  no  appropriate  search  heuristics  for 
the  last  and  negation  operators  are  known. 

In  addition  to  the  syntactic  expression  describing  the  network  and  the  necessary 
statistics  (discussed  in  Section  7.3),  a  hypothesis  also  contains  the  state  of  each 
of  its  SR  components.  When  a  new  hypothesis  is  created  with  SR  as  the  top-level 
operator,  that  component’s  state  is  set  to  _L.  The  state  of  SR  components  occurring 
in  the  operands  is  copied  from  the  operand  hypotheses.  In  order  to  keep  all  state 
values  up  to  date,  a  new  state-update  phase  is  added  to  the  update  function.  In 
the  state-update  phase,  the  new  state  of  each  SR  component  of  each  hypothesis  is 
calculated  as  a  function  of  the  input  vector  and  the  old  state,  then  stored  back  into 
the  hypothesis.  The  result  of  this  calculation  may  be  1,  0,  or  J_. 

Expressions  containing  SR  operators  may  be  partially  simplified  using  an  ex¬ 
tension  of  the  simplification  procedure  used  for  standard  Boolean  expressions.  This 
extended  simplifier  is  also  described  in  Appendix  B. 


8.2.2  Statistics 

The  statistical  modules  for  GTRL-S  differ  from  GTRL  only  when  satisfies(i,  h )  returns 
the  value  J_.  In  that  case,  none  of  the  statistics  is  updated.  Once  satis fies(i,h) 
becomes  defined  for  any  input  t,  it  will  remain  defined  for  every  input,  so  this  has 
no  effect  on  the  distribution  of  the  instances  for  which  statistics  are  collected,  just 
on  when  the  collection  of  statistics  begins. 
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8.2.3  Search  Heuristics 

The  problem  of  guiding  the  search  for  generating  sequential  networks  is  considerably 
more  difficult  than  for  pure  functional  networks.  Statistics  collected  about  the 
performance  of  expressions  as  generators  of  actions  in  the  world  are  not  necessarily  a 
strong  indication  of  their  performance  as  the  set  or  reset  signal  of  an  SR  component. 
They  can  still  provide  some  guidance,  however. 

Recall  the  logical  definition  of  SR  as 

SR(a,  b)  =  a  V  (ife  A  #SR(a,  b ))  . 

First,  we  can  see  that  a  — ►  SR(a,  b )  and  that  SR(a,  b)  — ►  (a  V  ->b).  The  first 
observation  should  guide  us  to  choose  set  operands  that  are  sufficient  for  the  target 
hypothesis.  The  second  observation  is  slightly  more  complex,  due  to  the  fact  that 
set  takes  precedence  over  reset,  but  it  makes  it  reasonable  to  choose  reset  operands 
whose  negations  are  necessary  for  the  target  hypothesis.  From  these  observations 
we  can  derive  the  following  heuristic: 

When  making  a  set-reset  hypothesis,  use  a  set  operand  that  has  a  Ipgh 
value  of  sufficiency  and  a  reset  operand  whose  negation  has  a  high  value 

of  necessity.  ! 

\ 

» 

8.2.4  Complexity 

The  computational  complexity  of  the  GTRL-S  algorithm  is  the  same  as  that  of  GTRL, 
which  is  discussed  in  Section  7.7.  The  only  additional  work  performed  by  GTRL-S 
is  the  state-update  computation.  It  has  complexity  0(L(H  +  C)2L)  (assuming  that 
H  and  C  are  independent  of  level),  which  is  of  the  same  order  as  the  statistical 
updating  phase  that  occurs  in  both  algorithms. 


8.3  Experiments  with  GTRL-S 

This  section  documents  experiments  with  GTRL-S  in  some  simple  domains  that 
require  action  mappings  with  state.  There  are  no  direct  comparisons  with  other 


8.3.  EXPERIMENTS  WITH  GTRL-S 


139 


algorithms  because  no  other  comparable  algorithms  that  learn  action  mappings 
with  state  from  reinforcement  axe  known. 


8.3.1  Lights  and  Buttons 

The  first  domain  of  experimentation  is  very  simple.  It  can  be  thought  of  as  con¬ 
sisting  of  two  light-bulbs  and  two  buttons.  The  input  to  the  agent  is  a  vector  of 
two  bits,  the  first  having  value  1  if  the  first  light  bulb  is  on  and  the  second  having 
value  1  if  the  second  light  bulb  is  on.  The  agent  can  generate  two  actions:  action  0 
causes  the  first  button  to  be  pressed  and  action  1  causes  the  second  button  to  be 
pressed.  One  or  no  lights  will  be  on  at  each  instance.  The  optimal  action  map  is  to 
push  the  button  corresponding  to  the  fight  that  is  on  if,  in  fact,  a  fight  is  on.  If  no 
fights  are  on,  the  optimal  action  is  to  push  the  button  associated  with  the  fight  that 
was  last  on.  A  fight  is  turned  on  on  a  given  tick  with  probability  pi — the  particular 
fight  is  chosen  randomly  and  equiprobably.  Thus,  the  optimal  hypothesis  is  simply 
SR(&i,  bo). 

Figure  60  shows  parts  of  the  trace  of  a  sample  run  of  the  GTRL-S  algorithm  in  the 
simple  fights  and  buttons  domain,  in  which  the  correct  action  (as  discussed  above) 
yields  reinforcement  value  1  with  probability  .9  and  the  incorrect  action  yields  rein¬ 
forcement  value  1  with  probability  .1.  A  fight  comes  on  each  tick  with  probability 
.1.  The  first  section  of  the  trace  shows  the  state  of  the  algorithm  after  100  ticks. 
We  can  see  that  the  correct  hypothesis,  SR(&i,  b0),2  has  just  been  found  and  ap¬ 
pears  to  be  the  best.  After  200  ticks,  we  can  see  two  recently-created  hypotheses 
being  tested.  They  are  found  wanting,  however.  By  tick  500,  the  original  winning 
hypothesis  is  still  near  the  top  of  the  list,  surpassed  only  by  another  equivalent  ex¬ 
pression,  SR(->60»-'&i)-  The  GTRL-S  algorithm  works  quite  reliably  on  this  problem 
because  the  search  heuristics  provide  good  guidance.  In  the  statistics  of  the  atomic 
hypotheses  at  level  0,  it  is  easy  to  see  that  b\  is  the  most  sufficient  hypothesis  and 
->bo  is  the  most  necessary. 

2The  third  value  in  the  SR  expressions  of  the  printout  indicates  the  stored  value  of  the  unit:  t 
for  1,  nil  for  0,  and  bottom  for  i.  (which  does  not  happen  to  occur  in  this  trace.) 
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Figure  60:  A  sample  run  of  the  GTRL-S  algorithm  on  the  simple  lights  and  buttons 
problem 
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8.3.2  Many  Lights  and  Buttons 


The  lights-and- buttons  domain  described  in  Section  8.3.1  can  be  easily  extended 
to  have  an  arbitrary  number,  M,  of  lights  and  buttons.  If  we  let  each  input  bit 
correspond  to  a  light  and  each  output  bit  correspond  to  the  pressing  of  a  button, 
we  have  an  environment  with  M  input  and  M  output  bits.  The  agent  is  never 
rewarded  for  pressing  more  than  one  button  at  once. 


The  more  complex  lights-and-buttons  problem  can  be  solved  by  using  the  CAS¬ 
CADE  method  in  conjunction  with  GTRL-S,  with  one  copy  of  the  GTRL-S  algorithm 
for  each  bit  of  output  (corresponding  to  each  button.)  Figure  61  shows  excerpts 
from  a  sample  run  with  two  lights  and  two  buttons  (this  differs  from  the  domain 
described  in  the  previous  section  in  that  there  are  two  output  bits  rather  than  only 
one.)  The  first  two  levels  belong  to  the  instance  of  GTRL-S  for  the  first  output 
bit  and  the  second  two  levels  belong  to  the  second  instance  of  GTRL-S.  After  the 
first  100  ticks,  neither  instance  has  found  the  correct  hypothesis  and  the  perfor¬ 
mance  is  quite  poor.  By  tick  200,  however,  the  best  hypothesis  for  the  first  bit 
is  SR(-«fci,-i6o),  which  is  equivalent  to  SR(&o,  &i),  the  correct  function.  The  best 
hypothesis  for  the  second  bit  is  SR(&i,  &o),  which  is  also  correct.  Again,  it  is  easy  to 
verify  that  the  necessity  and  sufficiency  heuristics  are  a  good  guide  for  the  search. 


The  search  heuristics  for  SR  fail  us  when  we  wish  to  extend  this  problem  to  a 
larger  number  of  lights  and  buttons  using  a  cascade  of  3-level  instances  of  GTRL-S. 
When  there  are  three  lights  and  buttons,  the  optimal  function  for  the  first  bit  can 
be  most  simply  expressed  as  SR(6o,f>i  V  62)-  In  order  to  synthesize  this  expression, 
the  expression  ~'b1/\~<b2  must  be  available  at  the  previous  level.  For  that  to  happen, 
-i&!  and  ->b2  must  be  highly  sufficient,  which  is  false,  in  general.  Thus,  the  only 
way  to  learn  this  function  is  to  generate  all  sub-expressions  exhaustively,  which  is 
computationally  prohibitive. 
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Figure  61:  A  sample  run  of  the  GTRL-S  algorithm  on  the  two-bit  lights  and  buttons 
problem.  Only  the  4  most  predictive  hypotheses  are  shown  at  each  non-atomic 
level. 
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8.4  Conclusion 

Although  the  approach  embodied  in  GTRL-S  is  capable  of  learning  some  simple 
action  maps  with  state,  it  does  not  hold  much  promise  for  more  complex  cases.  In 
such  cases,  it  may,  in  fact,  be  necessary  to  learn  a  state-transition  model  of  the 
world  and  values  of  the  world  states,  using  a  combination  of  Rivest  and  Schapire’s 
[56]  method  for  learning  models  with  hidden  state  and  Sutton’s  [72]  or  Whitehead 
and  Ballard’s  [80]  method  for  “compiling”  transition  models  into  action  maps.  This 
will  be  a  difficult  job — currently  available  methods  for  learning  models  with  hidden 
state  only  work  in  deterministic  worlds.  Even  if  they  did  work  in  non-deterministic 
worlds,  they  attempt  to  model  every  aspect  of  the  world’s  state  transitions.  In 
realistic  environments,  there  will  be  many  more  aspects  of  the  world  state  than  the 
agent  can  track,  and  its  choice  of  which  world  states  to  represent  must  be  guided 
by  reinforcement,  so  that  it  can  learn  to  make  only  the  “important”  distinctions. 
Drescher’s  work  on  generating  “synthetic  items”  [18]  is  a  promising  step  in  this 
direction.  His  “schema  mechanism”  attempts  to  learn  models  of  the  world  that  will 
enable  problem  solving.  When  it  is  unsuccessful  at  discovering  which  preconditions 
will  cause  a  particular  action  to  have  a  particular  result,  it  “reifies”  that  set  of 
preconditions  as  an  “item”  and  attempts  to  discover  tests  for  its  truth  or  falsity.  In 
many  cases  the  reified  item  turns  out  to  be  a  particular  aspect  of  the  state  of  the 
world  that  is  hidden  from  the  agent. 


Chapter  9 

Delayed  Reinforcement 


Until  now,  we  have  only  considered  algorithms  for  learning  to  act  in  environments 
in  which  local  reinforcement  is  generated  each  tick,  giving  the  agent  all  of  the 
information  it  will  ever  get  about  the  success  or  failure  of  the  action  it  just  took. 
This  is  a  simple  instance  of  the  more  general  case,  in  which  actions  taken  at  a 
particular  time  may  not  be  rewarded  or  punished  until  some  time  in  the  future. 
This  chapter  surveys  some  existing  approaches  to  the  problem  of  learning  from 
delayed  reinforcement,  focusing  on  the  use  of  temporal  difference  methods  [71],  such 
as  Sutton’s  adaptive  heuristic  critic  method  [70]  and  Watkins’  Q-leaming  method 
[78].  It  will  be  shown  how  these  methods  can  be  combined  with  the  pure  function- 
learning  algorithms  presented  in  previous  chapters  to  create  a  variety  of  systems 
that  can  learn  from  delayed  reinforcement. 


9.1  Q  Learning 

There  are  well-known  dynamic  programming  methods,  such  as  policy  improvement 
[57]  that  can  be  used  for  computing  the  optimal  action  mapping  for  an  agent,  given 
a  complete  state-transition  model  of  the  world.  Watkins  has  developed  a  method 
for  learning  from  delayed  reinforcement  that  he  describes  [78]  as  “incremental  dy¬ 
namic  programming  by  a  Monte  Carlo  method:  the  agent’s  experience — the  state- 
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Algorithm  16  (q)  The  initial  state  s0  is  an  array  indexed  by  the  set  of  input  states 
and  the  set  of  actions,  whose  elements  are  initialized  to  some  constant  value. 

u(s,i,a,r)  =  s[i',a']  =  (1  -  a)s[i',a']  + a(r +  iU(i)) 
e(s,i)  =  a  such  that  s[i,  a]  is  maximized 

where  i'  and  a'  are  the  input  and  action  values  from  tick  t  —  1,  0  <  a  <  1,  0  <  7  <  1, 
and  U(i)  =  maxa{s[t,a]}. 

Figure  62:  The  Q-learaing  algorithm. 

transitions  and  the  rewards  that  the  agent  observes — are  used  in  place  of  transition 
and  reward  models.” 

Watkins’  method  is  referred  to  as  Q-leaming  because  it  is  concerned  with  learn¬ 
ing  values  of  Q(i,  a),  where  i  is  an  input,  a  is  an  action,  and  Q(i,  a)  is  the  expected 
discounted  reward  of  taking  action  a  in  input  state  i  then  continuing  by  following 
the  optimal  policy.  The  agent’s  policy  is  always  to  execute,  in  input  state  i,  the 
action  a  for  which  its  estimate  of  Q(i ,  a)  is  maximized.  The  Q  algorithm  is  described 
formally  in  Figure  62. 

The  initial  state  of  the  Q  algorithm  is  simply  the  array  of  estimated  Q  values, 
indexed  by  the  input  and  action  sets.  To  evaluate  an  input  instance,  i ,  the  action, 
a,  that  maximizes  Q(i,  a)  is  generated.  The  update  function  adjusts  the  estimated 
Q  value  of  the  previous  input  and  action  in  the  direction  of 

r  +lU(i)  , 

which  is  the  actual  reinforcement  received,  r,  plus  a  discounted  estimate  of  the 
value  of  the  next  state,  'yU(i).  The  function  U(i)  estimates  the  value  of  an  input  i 
by  returning  the  estimated  Q  value  of  the  best  action  that  can  be  taken  from  that 
state.  This  update  rule  illustrates  the  concept  of  temporal  difference  learning,  which 
was  formulated  by  Sutton  [71].  Rather  than  waiting  until  a  reinforcement  value  is 
received  and  then  propagating  it  back  along  the  path  of  states  that  lead  up  to  it, 
each  state  is  updated  as  it  is  encountered  by  using  the  discounted  estimated  value 
of  the  next  state  as  a  component  of  the  reinforcement.  Initially,  these  estimated 
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values  axe  meaningless,  but  as  the  agent  experiences  the  world,  they  soon  begin  to 
converge  to  the  true  values  of  the  states. 

Watkins  does  not  specify  what  the  initial  estimated  Q  values  should  be.  If  the 
value  0  is  used  and  the  optimal  action  values  are  positive,  the  algorithm  will  almost 
certainly  fail,  because  it  always  chooses  the  action  with  the  highest  Q  value.  As 
soon  as  one  action  has  positive  value  associated  with  it,  it  will  be  chosen  forever 
more,  to  the  exclusion  of  the  other  actions.  There  are  two  simple  solutions  to  this 
problem.  One  is  to  perform  random  actions  with  a  certain  small  probability.  This 
guarantees  that  the  whole  space  will  eventually  be  explored,  but  can  take  a  long 
time.  Also,  even  if  the  best  states  are  eventually  reached,  if  they  occur  only  rarely, 
it  may  not  have  a  significant  effect  on  the  Q  values.  Another  solution  is  to  set 
the  initial  Q  values  to  be  higher  than  any  of  the  actual  Q  values.  This  causes  a 
process  similar  to  the  operation  of  the  IE  algorithm,  in  which  the  actions  are  chosen 
alternately  until  the  Q  values  are  driven  down  to  the  actual  action  values.  If  the 
initial  Q  values  are  much  too  high,  however,  this  process  can  take  a  long  time;  it 
is  effective  only  if  a  relatively  tight  upper  bound  on  the  action  values  is  known  a 
priori. 

As  the  agent  gains  experience  in  the  world,  the  Q  values  begin  to  become  true 
reflections  of  the  action-values  of  the  states  in  the  world,  given  that  the  optimal 
policy  is  being  executed.  Watkins  proved  that,  in  fact,  the  Q  values  will  converge 
to  the  values  of  the  actions  under  the  optimal  policy  given,  among  other  conditions, 
that  each  input-action  pair  is  experienced  an  infinite  number  of  times. 


9.2  Q-Learning  and  Interval  Estimation 

The  Q  algorithm,  as  presented  above,  does  not  guarantee  that  each  input-action  pair 
will  be  sampled  an  infinite  number  of  times.  It  is  often  the  case  that  a  particular 
action  has  a  high  Q  value  in  a  given  state  early  on  and  other  actions  in  that  state  are 
rarely,  if  ever,  tried  again.  One  approach  to  solving  this  problem  (although  it  still 
does  not  guarantee  convergence)  is  to  apply  the  basic  idea  of  interval  estimation, 
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Algorithm  17  (IEQ)  The  initial  state  is  an  array  indexed  by  the  set  of  input 
states  and  the  set  of  actions,  whose  elements  are  initial  states  of  a  normal  or  non- 
parametric  central-value  estimator. 

u(s ,  t,  a, r)  =  s[i',  a']  :=  update- stats{s[i\ o'],  r  +  7 U(i)) 
e(s,i)  =  a  such  that  is  maximized 

where  0  <  0  <  1,  0  <  7  <  1,  and  U(i )  =  max„{ er(s[i,  a])}  (er  is  the  expected 
reinforcement  of  performing  action  a  in  state  i). 


Figure  63:  The  IEQ  algorithm. 


choosing  the  action  with  the  highest  upper  bound  on  the  underlying  Q  value.  This 
approach  is  embodied  in  the  IEQ  algorithm,  shown  in  Figure  63. 

This  algorithm  can  use  either  a  normal  or  non-parametric  model  to  estimate  the 
expected  action  values.  Using  the  normal  distribution  as  a  model  can  be  dangerous, 
however,  because  at  the  beginning  of  this  process,  the  sample  variance  is  often 
0,  which  causes  the  confidence  intervals  to  be  degenerate.  The  normal  and  non- 
parametric  methods  for  generating  confidence  intervals  were  informally  discussed  in 
Section  4.5.2  and  are  presented  in  detail  in  Appendix  A. 

The  function  U  changes  over  time,  making  early  reinforcement  values  no  longer 
representative  of  the  current  value  of  a  particular  action.  This  problem  is  already 
dealt  with,  in  part,  by  the  nature  of  the  bounded-space  non-parametric  techniques, 
because  only  a  sliding  window  of  data  is  kept  and  used  to  generate  upper  bounds. 
However,  this  does  not  guarantee  that  poor-looking  actions  will  be  taken  periodically 
in  order  to  see  if  they  have  improved.  One  way  of  doing  this  is  to  decay  the  statistics, 
periodically  dropping  old  measurements  out  of  the  sliding  windows,  making  them 
smaller.  A  similar  decay  process  can  be  used  in  the  normal  statistical  model,  as  well. 
Decaying  the  statistics  will  have  the  effect  of  increasing  upper  bounds,  eventually 
forcing  the  action  to  be  re-executed.  This  method  will  keep  the  algorithm  from 
absolutely  converging  to  the  optimal  policy,  but  the  optimal  policy  can  be  closely 
approximated  by  decreasing  the  decay  rate  over  time.  The  IEQ  algorithm  has  three 
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parameters:  7,  the  discount  factor,  a,  the  size  of  the  confidence  intervals,  and  6, 
the  decay  rate. 

The  biggest  practical  improvement  of  IEQ  over  Q  is  that  it  is  no  longer  necessary 
to  estimate  the  values  of  the  states  in  order  to  generate  appropriate  initial  values. 
In  the  context  of  the  Dyna  architecture  [72],  Sutton  has  recently  developed  a  similar 
extension  to  Q-learning,  called  Dyna-Q+,  in  which  a  factor  measuring  uncertainty 
about  the  results  of  actions  is  added  to  the  Q  values,  giving  a  bonus  to  exploring 
actions  about  which  little  is  known. 


9.3  Adaptive  Heuristic  Critic  Method 

Sutton  [70,71]  has  developed  a  different  approach  of  applying  the  temporal  difference 
method  to  learning  from  delayed  reinforcement.  Rather  than  learning  the  value  of 
every  action  in  every  input  state,  the  adaptive  heuristic  critic  (ahc)  method  learns 
an  evaluation  function  that  maps  input  states  into  their  expected  discounted  future 
reinforcement  values  given  that  the  agent  executes  the  policy  it  has  been  executing. 
One  way  of  viewing  this  method  is  that  the  AHC  module  is  learning  to  transduce 
the  delayed  reinforcement  signal  into  a  local  reinforcement  signal  that  can  be  used 
by  any  of  the  algorithms  of  the  previous  chapters.  The  algorithm  used  to  learn 
from  the  local  reinforcement  signal  need  only  optimize  the  reinforcement  received 
on  the  next  tick;  such  an  algorithm  is  referred  to  as  a  local  (as  opposed  to  global) 
learning  algorithm.  It  is  a  requirement,  however,  that  the  local  learning  algorithm 
be  capable  of  learning  in  nonstationary  environments,  because  the  AHC  module  will 
be  learning  a  transduction  that  changes  as  the  agent’s  policy  changes. 

The  AHC  method,  in  combined  operation  with  an  algorithm  for  learning  from 
local  reinforcement,  is  formally  described  in  Figure  64.  There  are  two  components 
to  the  state  of  the  AHC  algorithm:  the  vectors  v  and  c.  The  v  vector  contains,  at 
every  tick,  the  current  best  estimate  of  the  discounted  future  value  of  each  state  with 
discount  rate  7,  given  that  the  agent  is  executing  the  behavior  that  it  is  currently 
executing.  The  c  vector  values  represent  the  “activation”  values  of  the  states.  States 
that  have  been  visited  recently  have  high  activation  values  and  those  that  have  not 
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Algorithm  18  (ahc)  The  initial  state,  So,  consists  of  three  parts:  two  n- 
dimensional  vectors,  c  and  v,  and  s\,  the  initial  state  of  the  local  learning  algorithm. 

u(s,  i ,  a ,  r)  =  for  j  :=  0  to  n  do 
c[j]  :=  7  A  c\j] 
c[t']  :=  c[i']  +  1 
vi  :=  r[t];  vi' v[i'] 
for  j  :=  0  to  n  do 

v\j\  :=  v[j]  +  a  c[j]  (r"  +  7  vi  -  vi') 
si  :=  u,(s,,t",a",r[i']) 
e(s,i)=  e,(shi ) 

where  i'  and  a!  are  the  input  and  action  values  from  tick  t  —  1;  i" ,  a",  and  r"  are 
from  tick  t  —  2;  n  is  the  size  of  the  input  set;  s/,  Ui  and  e/  are  the  internal  state, 
the  update  function,  and  the  evaluation  function  of  the  local  learner;  0<A<1; 
0  <  7  <  1;  and  0  <  a  <  1. 

Figure  64:  The  AHC  algorithm. 

been  visited  recently  have  low  values.  Each  of  these  vectors  is  initialized  to  contain 
0  values. 

The  update  function  first  updates  the  activation  values.  Each  element’s  activa¬ 
tion  is  multiplied  by  A7,  where  7  is  the  discounting  rate  and  A  is  an  independent 
factor  that  controls  the  degree  to  which  activation  is  spread  backward  from  the 
currently  active  state.  Then,  the  activation  of  the  state  whose  value  is  being  up¬ 
dated  on  this  tick,  state  i\  is  increased  by  1.  The  values  of  states  are  adjusted  in 
proportion  to  their  activations,  so  for  A  =  0,  only  the  currently  active  state’s  value 
is  updated  on  each  tick. 

Next,  the  state  values  in  vector  v  are  updated.  Each  value  v[^]  is  incremented  by 
the  product  of  its  activation,  c[j],  the  learning  rate,  a ,  and  the  prediction  difference, 
r"  —  7t>[i]  —  v[t'].  The  quantity  r[i/]  is  the  estimated  value  of  state  V.  The  quantity 
r"+7t>[i]  is  a  one-step  lookahead  value  of  state  i\  computed  as  the  sum  of  the  global 
value  of  state  i'  (as  indicated  by  the  reinforcement  value  r"  of  the  previous  tick)  and 
the  discounted  value  of  the  next  state,  7»[i].  Since  the  one-step  lookahead  value  is 
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a  better  estimate  than  the  stored  value,  the  difference  between  the  two  values  can 
be  used  as  an  error  signal  for  updating  the  stored  value.  This  updating  method 
efficiently  propagates  global  reinforcement  values  back  along  the  chain  of  actions 
that  lead  to  them,  making  the  AHC  algorithm  another  instance  of  the  temporal 
difference  method. 

Finally,  the  update  function  feeds  a  learning  instance  to  the  update  function 
of  the  local  learning  algorithm.  The  reason  for  updating  the  local  learner  two 
ticks  behind  is  that  if  a  large  reinforcement  value  is  received,  we  would  like  it  to 
be  reflected  in  the  function  learner  as  soon  as  possible.  However,  if  a  large  r  is 
received  at  time  t,  it  takes  two  more  ticks  to  receive  the  data  that  will  allow  its 
effect  on  v  to  be  calculated.  The  algorithm  would  not  be  incorrect  if  it  performed 
si  :=  u;(s/,f',a',  v[i])  instead,  but  it  would  not  respond  to  good  or  bad  results  the 
first  time  they  were  encountered. 

The  AHC  algorithm  has  no  effect  on  the  evaluation  process  and  simply  calls  the 
evaluation  method  of  the  local  learning  algorithm. 

Sutton  has  shown  [71]  that,  for  the  non-discounted  case,  the  expected  values 
of  the  predictions  found  by  the  temporal  difference  method  converge  to  the  ideal 
predictions  if  the  data  sequences  are  generated  by  Markov  processes  and  the  value 
of  parameter  A  equals  0.  When  A  =  1,  the  temporal  difference  method  generates 
the  same  weight  adjustments  as  the  Widrow-Hoff  rule.  Of  course,  when  the  agent  is 
choosing  actions  that  change  the  state  of  the  world,  the  distributions  of  input  data 
change  and  these  results  do  not  necessarily  hold. 

Sutton’s  presentation  of  the  AHC  algorithm  was  combined  with  a  version  of  the 
LARC  algorithm  for  local  learning.  The  AHC  method  is  presented  here  independent 
of  assumptions  about  the  local  learning  algorithm.  This  way  of  breaking  down 
the  problem  is  very  useful,  because  it  allows  us  to  independently  choose  a  local 
reinforcement-learning  algorithm  that  is  appropriate  for  the  sorts  of  environments 
in  which  it  will  be  run  for  use  in  combination  with  the  AHC  algorithm.  In  addition, 
Sutton  used  linear  association  methods  to  store  the  values  of  t;  and  c  more  efficiently. 
In  this  version,  the  activation  and  state  values  are  simply  stored  in  a  table,  but  it 
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is  easy  to  see  how  a  variety  of  more  efficient  (if  less  precise)  associative  storage 
methods  could  be  applied. 

There  have  been  a  number  of  implementations  of  temporal  difference  algorithms 
similar  to  AHC,  but  none  have  had  a  correct  analysis  of  convergence  results.  The 
AHC  work  grew  out  of  the  adaptive  critic  element  (ace)  used  by  Barto,  Sutton,  and 
Anderson  [11]. 

Witten’s  [86]  adaptive  optimal  controller  algorithm  computes  state  values  as  in 
the  AHC  algorithm,  but  differs  from  Sutton’s  work  in  the  way  it  is  combined  with 
the  local  learner.  This  difference  causes  its  performance  to  be  significantly  inferior 
[70]. 

One  of  AI’s  most  striking  early  successes  was  Samuel’s  checkers-playing  program 
[60,61].  In  one  of  its  learning  modes,  it  learned  an  evaluation  function  for  board  po¬ 
sitions  from  reinforcement.  Although  Samuel’s  learning  procedure  is  very  complex, 
it  can  be  closely  approximated  by  the  AHC  algorithm  with  7  =  1. 

Another,  more  distantly  related,  learning  method  is  Holland’s  bucket  brigade 
method  for  assigning  credit  to  chains  of  rules  firing  in  a  production  system  [33]. 
It  differs  significantly  in  the  details,  but  shares  the  temporal-difference  notion  of 
assigning  credit  along  a  sequence  based  on  the  local  predicted  improvement  rather 
than  waiting  for  global  reinforcement. 


9.4  Other  approaches 

There  have  been  a  number  of  other  approaches  to  learning  from  delayed  reinforce¬ 
ment.  They  can  be  divided  into  those  that  learn  a  world  model  (generally  assuming, 
unlike  Rivest  and  Schapire  [56],  that  there  is  no  hidden  state)  and  those  that  do 
not. 

Drescher  [18]  presents  a  theory  and  implementation  of  learning  based  on  the 
developmental  psychology  of  Piaget.  The  agent  learns  precondition-action-result 
schemata  that  allow  it  to  achieve  dynamically  presented  goals.  Drescher’s  methods 
have  been  demonstrated  in  a  simple  deterministic  world  with  hidden  state.  There 
have  been  a  number  of  other  efforts  to  learn  world  models.  These  include  the  work 
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of  Sutton  and  Pinette  [73],  Mason,  Christiansen,  and  Mitchell  [40],  Mel  [42],  and 
Shen  [68]. 

There  has  been  a  series  of  attempts  to  solve  the  pole-balancing  problem  using 
reinforcement.  The  problem  is  motivated  by  a  physical  system  in  which  a  pole 
is  flexibly  mounted  on  a  cart.  The  pole  can  rotate  about  its  connection  to  the 
cart  in  one  dimension,  and  the  cart  can  move  along  a  one-dimensional  track  (in 
the  same  dimension  a s  the  plane  in  which  the  pole  moves).  The  goal  is  to  control 
the  cart  in  such  a  way  as  to  keep  the  pole  from  falling  over  and  to  keep  the  cart 
from  reaching  either  end  of  its  track.  The  system  is  given  an  encoding  of  the 
positions  and  velocities  of  the  angle  of  the  pole  with  respect  to  the  cart  and  the 
offset  of  the  cart  with  respect  to  the  midpoint  of  the  track,  and  the  system  chooses 
between  applying  a  fixed-magnitude  force  on  the  cart  in  either  a  positive  or  negative 
direction.  Negative  reinforcement  is  received  whenever  the  pole  falls  over  or  the  cart 
reaches  the  end  of  its  track.  The  system  must  learn  a  “bang-bang”  control  law  that 
maximizes  reinforcement  by  keeping  the  pole  up  and  the  cart  within  limits  for  as 
long  as  possible. 

The  first  learning  solution  to  this  problem  was  the  BOXES  system  of  Michie  and 
Chambers  [44].  It  was  so  named  because  of  the  quantization  of  the  four-dimensional 
continuous-valued  parameter  space  into  a  set  of  255  regions  or  “boxes.”  Each  box 
was  viewed  as  making  a  separate  decision  about  whether  to  generate  a  “left”  or 
“right”  action  when  the  system  was  in  that  state,  based  on  the  expected  run  length 
given  each  choice  of  action.  Learning  only  took  place  after  a  failure,  and  each  policy 
was  tested  for  an  entire  run.  The  details  of  the  method  are  complex  and  somewhat 
ad  hoc ,  but  it  recognizes  the  interesting  issues  of  the  problem  setting,  including 
temporal  credit  assignment  and  the  tradeoff  between  acting  to  gain  information 
and  acting  to  gain  reinforcement. 

Connell  and  Utgoff’s  CART  system  [17]  takes  advantage  of  the  continuity  of 
the  parameter  space,  using  an  algorithm  that  does  not  make  an  a  priori  division 
of  the  space  into  discrete  boxes.  Points  in  the  state  space  are  determined  from 
experience  to  be  either  desirable  or  not  desirable — interpolation  is  used  to  determine 
the  desirability  of  states  that  have  not  yet  been  visited.  The  system  has  considerably 
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better  performance  than  either  the  BOXES  system  or  the  application  of  the  AHC 
algorithm  to  this  problem  by  Selfridge  and  Sutton  [67]  or  by  Anderson  [3,4]-  The 
difference  in  performance  seems  principally  to  depend  on  differences  in  the  encodings 
of  the  inputs,  however. 


9.5  Complexity  Issues 

Whether  we  are  learning  action  values  or  an  evaluation  function,  we  are  confronted 
again  with  the  problem  of  high  computational  complexity. 

With  the  Q  and  IEQ  algorithms,  we  are  back  again  to  the  kinds  of  exponential 
complexity  in  the  size  of  the  input  and  output  that  we  have  been  trying  to  avoid. 
Watkins  addresses  this  issue  for  Q-learning  by  using  Albus’  CM  AC  method  [2]  for 
associating  Q  values  with  input-action  pairs  for  its  “computational  speed  and  sim¬ 
plicity,  rather  than  accuracy  or  storage  economy.”  It  is  possible  to  use  a  CMAC  that 
is  very  space  efficient,  but  at  a  potentially  great  cost  in  accuracy. 

Another  method  of  improving  computational  complexity  at  the  expense  of  ac¬ 
curacy  is  to  use  a  linear  associator  to  store  the  values  being  learned.  The  Q  values 
could  be  stored  as  a  function  of  a  bit  vector  constructed  by  concatenating  the  bit- 
vector  encodings  of  the  input  state  and  the  action.  Sutton  uses  this  method  in 
his  implementation  of  AHC,  storing  the  evaluations  of  input  states  as  functions  of 
bit-vector  encodings  of  those  states.  It  is  difficult  to  quantify  exactly  how  much 
expressive  power  is  lost  by  using  such  methods  and  how  that  loss  in  expressiveness 
will  impact  the  performance  of  the  learning  methods  as  a  whole.  A  related  method, 
used  by  Anderson  [3],  is  to  store  predictions  in  a  multi-layer  network  trained  us¬ 
ing  the  error-backpropagation  method  (Section  3.4.3  describes  this  method  in  more 
detail). 

Algorithms,  such  as  IEQ,  that  must  associate  a  whole  collection  of  data  with  an 
input-action  pair  are  harder  to  make  more  efficient  in  this  way. 
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Figure  65:  Environment  Dl:  a  very  simple  delayed-reinforcement  environment. 


Figure  66:  Environment  D2:  a  more  difficult  delayed-reinforcement  environment. 

9.6  Empirical  Comparison 

This  section  describes  the  results  of  three  different  methods  of  learning  from  delayed 
reinforcement  in  three  simple  simulated  environments. 


9.6.1  Environments 

The  first  two  environments  are  taken  from  Sutton’s  thesis  [70].  Figures  65  and  66 
show  their  state-transition  diagrams.  The  circled  numbers  axe  the  reinforcement 
values  of  the  states;  most  of  the  states  have  reinforcement  value  0  (which  is  omitted 
from  the  figure).  The  first  is  a  very  easy  deterministic  environment.  The  second  is  a 
considerably  more  difficult  non-deterministic  environment,  with  little  differentiation 
between  “good”  and  “bad”  actions.  The  third  environment,  from  Watkins  [78],  is 
shown  in  Figure  67.  It  was  constructed  to  be  misleading,  because,  although  the 
correct  action  in  state  0  is  0,  if  the  agent  is  executing  a  random  policy,  the  action  1 
will  have  a  higher  value.  Before  we  apply  the  learning  algorithms  to  these  domains, 
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Figure  67:  Environment  D3:  a  highly  misleading  delayed- reinforcement  environ¬ 
ment. 

it  is  interesting  to  consider  the  values  of  the  states  and  the  expected  reinforcement 
of  acting  optimally  in  each  case. 

The  optimal  strategy  for  environment  D1  is,  obviously,  always  to  execute  action 
1.  Because  the  world  is  deterministic,  it  will  take  five  steps  to  get  payoff  1,  so  the 
average  reinforcement  of  the  optimal  policy  is  0.2.  The  values  of  the  states  can  be 
calculated  by  solving  the  following  set  of  equations,  which  specify  the  value  of  each 
state  in  terms  of  its  global  value  and  the  discounted  value  of  its  successor  under  the 
optimal  policy: 

v0  =  1  +  7*>i 

Vi  =  7t>2 

v2  =  jv3 
v3  =  7U4 
v4  =  jv0 

v0  =  1/(1 -75) 

=  74/(l-75) 

”2  =  73/(l-75) 


The  solution  to  the  equations  is 
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t*  =  77(1 -75) 
v4  =  7/(1 -75) 

„  which,  for  7  =  .9,  yields  the  following  values:  v0  =  2.44,  vj  =  1.60,  t>2  =  1-78,  v3  = 

1.98,  v4  =  2.20. 

The  second  automaton,  D2,  is  non-deterministic.  In  this  case,  the  optimal  strat- 
*  egy  is  also  always  to  execute  action  1.  The  expected  number  of  failures  preceding 

the  first  success  in  a  sequence  of  Bernoulli  trials  with  probability  p  is  (1  —  p)/p,  so 
we  expect  to  remain  in  each  of  states  1  through  4  for  an  average  of  1 +0.4/0.6  =  1.67 
steps  when  executing  the  optimal  policy.  Thus,  the  total  expected  round-trip  time 
is  4  x  1.67  +  1  =  7.67,  making  the  expected  reinforcement  per  tick  approximately 
equal  to  0.13.  The  action  values  are  the  solution  to  the  equations 

t>o  =  1  +  7vi 
Vi  =  7(.4uj  +  .6v2) 
v2  =  7(.4u2  +  .6u3) 
u3  =  7(-4t>3  +  -6t>4) 
v4  =  7(.4v4  +  .6i>0) 

which,  for  7  =  .9,  is  v0  =  1.84,  =  0.93,  v2  =  1.10,  r3  =  1.31,  v4  =  1.55. 

Finally,  for  the  complex  automaton  D3,  the  optimal  strategy  is  to  take  action  0 
in  state  0  and  action  1  in  states  5,  6  and  7.  This  path  through  the  transition  graph 
takes  5  steps  to  gain  reinforcement  value  2,  yielding  an  average  reinforcement  per 
tick  of  0.4.  The  values  of  the  states  under  the  optimal  strategy  can  be  expressed  as 


Vo 

— 

JVs 

= 

yv2 

v2 

= 

yv3 

t>3 

= 

yv4 

V4 

= 

l  +  7yo 

v5 

= 

yv6 
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l>6  =  7l>7 

v7  =  yv& 

Vg  =  2  +  7U0 

Solving  these  equations  with  7  =  .9  yields  the  state  values  v0  =  3.20,  V\  =  2.83,  v2  = 
3.15,  v3  =  3.50,  v4  =  3.88,  U5  =  3.56,  t>6  =  3.96,  v7  =  4.40,  vg  =  4.88. 


9.6.2  Algorithms 

The  following  three  algorithms  for  learning  from  delayed  reinforcement  were  tested 
on  each  of  these  problems: 

•  Q  (described  in  Figure  62) 

•  IEQ  (described  in  Figure  63) 

•  AHC  (described  in  Figure  64)  in  combination  with  a  version  of  the  IE  algorithm 
(described  in  Figure  21)  that  uses  normal  statistics  and  is  modified  for  use  in 
non-stationary  environments. 

It  would  have  been  appropriate  to  compare  Anderson’s  combined  back-propagation 
and  AHC  method  with  these  algorithms,  but  the  parameter  tuning  problem  for  that 
algorithm  seems  computationally  impractical. 

9.6.3  Parameter  Tuning 

Each  of  these  algorithms  has  a  number  of  parameters.  Algorithm  Q  has  parameters 

a  and  7;  IEQ  has  parameters  a.e,1  7,  and  6;  AHC  has  parameters  a,  7,  and  A;  and 

IE  with  normal  nonstationary  statistics  has  parameters  and  6.  The  parameter 

7  is  part  of  the  specification  of  the  correctness  criterion,  and  it  will  be  set  to  0.9 

for  each  algorithm  and  task.  To  illustrate  the  dependence  of  the  Q  algorithm  on 

its  initial  value,  two  versions  of  Q  will  be  tested:  one  with  initial  values  equal  to  0 

1  Because  we  are  using  statistics  for  the  normal  distribution,  it  is  easier  to  express  the  size  of  the 
confidence  intervals  in  terms  of  o  rather  than  z0/  j;  these  are  simply  two  ways  of  specifying  the  same 
parameter. 
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ALG-TASK 

Dl 

D2 

D3 

Q 

a 

.95 

.95 

.95 

IEQ 

<*,> 

.01 

.05 

.001 

6 

.999 

.9999 

.99 

AHC  +  IE 

a 

.15 

.1 

.5 

A 

.1 

.2 

1.0 

aie 

.05 

.05 

.001 

6 

.9999 

.99 

.99 

Table  9:  Best  parameter  values  for  each  algorithm  in  environments  Dl,  D2,  and 
D3. 

(which  is  below  the  action  values  in  all  cases)  and  one  with  initial  values  equal  to 
20  (which  is  well  above  the  action  values  in  all  cases).  These  two  algorithms  will  be 
referred  to  as  QO  and  Q20. 

For  each  algorithm  and  environment,  a  series  of  100  trials  of  length  3000  were 
run  with  different  parameter  values.  Table  9  shows  the  best  set  of  parameter  values 
found  for  each  algorithm-environment  pair.  The  parameter  a  for  the  Q  algorithms 
is  largely  irrelevant:  if  the  initial  value  is  too  small,  no  value  of  a  will  result  good 
performance;  if  the  initial  value  is  large,  a  should  be  as  large  as  possible. 

9.6.4  Results 

Using  the  best  parameter  values  for  each  algorithm  and  environment,  the  perfor¬ 
mance  of  the  algorithms  was  compared  on  100  runs  of  length  3000.  The  performance 
metric  was  average  reinforcement  per  tick,  averaged  over  the  entire  run.  The  re¬ 
sults  are  shown  in  Table  10,  together  with  the  expected  reinforcement  of  executing 
a  completely  random  behavior  (choosing  actions  0  and  1  with  equal  probability) 
and  of  executing  the  optimal  behavior. 

As  in  the  previous  sets  of  experiments,  we  must  examine  the  relationships  of 
statistically  significant  dominance  among  the- algorithms  for  each  task.  Figure  68 
shows,  for  each  task,  a  pictorial  representation  of  the  results  of  a  1-sided  t-test 
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ALG-TASK 

Dl  D2  D3 

Q0 

q20 

IEQ 

AHC  +  IE 

.0000  .0910  .0000 
.1907  .1222  .3780 
.1959  .1222  .2315 
.1988  .1153  .2923 

random 

optimal 

.1100  .1100  .1250 
.2000  .1300  .4000 

Table  10:  Average  reinforcement  for  tasks  Dl,  D2,  and  D3  over  100  runs  of  length 
3000. 

TASK  Dl  TASK  D2  TASK  D3 


Figure  68:  Significant  dominance  partial  order  among  delayed-reinforcement  algo¬ 
rithms  for  each  task. 


applied  to  each  pair  of  experimental  results.  The  graphs  encode  a  partial  order  of 
significant  dominance,  with  solid  lines  representing  significance  at  the  .95  level. 

With  the  best  parameter  values  for  each  algorithm,  it  is  also  instructive  to 
compare  the  rate  at  which  performance  improves  as  a  function  of  the  number  of 
training  instances.  Figures  69,  70,  and  71  show  superimposed  plots  of  the  learning 
curves  for  each  of  the  algorithms.  Each  point  represents  the  average  reinforcement 
received  over  a  sequence  of  100  steps,  averaged  over  100  runs  of  length  3000. 


9.6.5  Discussion 

There  are  no  clear  winners  among  this  set  of  algorithms.  On  the  simple  deterministic 
task  Dl,  all  of  the  algorithms  approach  the  optimal  performance  level  very  closely. 
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Figure  69:  Learning  curves  for  Task  Dl. 


Figure  70:  Learning  curves  for  Task  D2. 
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Figure  71:  Learning  curves  for  Task  D3. 

It  takes  the  longest  for  Q20  to  improve;  if  the  initial  values  were  smaller  it  would 
converge  faster.  When  the  initial  value  is  too  small,  as  in  QO,  the  algorithm  performs 
significantly  worse  than  random. 

The  non-deterministic  task  D2  is  very  difficult  because  of  the  similarity  in  tran¬ 
sition  probabilities  between  the  two  actions  in  each  state.  On  this  task,  algorithms 
Q20  and  IEQ  perform  essentially  equivalently,  approaching  but  not  achieving  optimal 
performance.  The  AHC+IE  algorithm  performs  very  poorly  at  first,  but  suddenly 
“realizes”  the  right  course  of  action  (perhaps  when  the  AHC  component  has  seen 
the  higher-numbered  states  enough  to  realize  that  they  are  significantly  better  and 
the  old  statistics  have  decayed  sufficiently  in  the  IE  component)  and  begins  to  per¬ 
form  as  well  as  the  other  two  algorithms.  As  usual,  QO  performs  far  worse  than  the 
random  strategy. 

Performance  on  the  difficult  problem  of  task  D3  hinges  on  persistently  trying, 
for  a  while,  courses  of  action  that  appear  bad.  This  persistence  is  necessary  to 
discover  that  the  left  loop  of  the  graph  is  better  if  the  proper  action  strategy  is 
known.  The  Q20  algorithm  does  a  good  job  of  this,  and  is  the  only  one  of  the 
algorithms  to  achieve  optimal  performance  during  the  course  of  a  3000-tick  rim. 
The  other  algorithms  improve  over  time,  but  not  nearly  as  fast.  The  fact  that  their 
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performance  rises  above  the  .2  level  (which  is  achieved  by  going  around  the  right 
loop  of  the  graph)  indicates  that  they  are  discovering  the  left  loop  of  the  graph.  The 
QO  algorithm  performs  as  badly  as  possible,  probably  by  looping  between  states  0 
and  5. 

More  extensive  experiments  will  be  required  before  it  is  possible  to  formulate 
general  rules  of  applicability  of  these  algorithms  to  specific  learning  tasks. 


Chapter  10 


Experiments  in  Complex  Domains 


This  chapter  reports  on  three  experiments  comparing  algorithms  introduced  in  pre¬ 
vious  chapters  on  more  complex  domains.  The  first  domain  is  a  simulated  one  with 
a  large  number  of  input  and  output  bits,  but  with  a  fairly  low-complexity  function 
defining  the  dependence  of  each  output  bit  on  the  input  bits.  The  second  domain 
is  a  mobile-robot  domain  in  which  the  agent  learns  from  local  reinforcement.  The 
third  domain  is  an  extension  of  the  mobile  robot  domain  in  which  the  agent  learns 
from  delayed  reinforcement.  The  settings  of  the  experiments  will  emulate,  as  much 
as  possible,  the  deployment  of  these  learning  algorithms  in  realistic  domains. 


10.1  Simple,  Large,  Random  Environment 


This  task,  in  its  general  form,  has  M  input  and  M  output  bits.  The  optimal  action 
mapping  is  generated  randomly  as  follows:  each  output  bit  is  the  conjunction  or 
disjunction  of  two  input  bits  or  their  negations.  If  the  agent  chooses  an  action  in 
agreement  with  this  mapping,  it  receives  reinforcement  value  1  with  probability  pi 
and  0  otherwise;  if  the  agent’s  action  disagrees  with  the  optimal  mapping,  it  receives 
reinforcement  value  1  with  probability  p?  and  0  otherwise. 
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10.1.1  Algorithms 

The  following  algorithms  were  tested  in  this  domain: 

•  IE 

•  CASCADE  +  IE 

•  CASCADE  +  GTRL 

The  second  and  third  algorithms  consist  of  a  set  of  Boolean-function  learners 
combined  using  the  CASCADE  method.  It  is  expected  that  the  cascade  of  GTRL 
algorithms  will  be  both  more  computationally  efficient  and  learn  more  quickly  than 
the  other  three  algorithms  because  the  functions  are  not  too  complex  and  the  op¬ 
portunity  for  generalization  is  great. 

10.1.2  Task 

The  algorithms  were  tested  on  an  instance  of  the  general  family  of  large  random 
environments  with  M  =  8,  pi  =  .8,  and  P2  =  .1.  It  would  have  been  desirable  to 
use  an  even  larger  task,  but  the  size  of  the  data  structures  for  M  =  8  exhausted 
the  available  computational  power.  Each  run  of  each  algorithm  was  on  a  newly 
generated  random  task  with  the  parameters  described  above. 

10.1.3  Parameter  Settings 

When  we  wish  to  use  a  learning  algorithm  in  a  new  setting,  we  will  rarely  have 
the  luxury  of  performing  extensive  parameter-tuning  runs  to  be  sure  that  we  get 
the  best  possible  performance  out  of  our  algorithms.  In  this  experiment,  as  well  as 
in  the  other  two  described  in  this  chapter,  parameters  for  the  algorithms  will  be 
chosen  as  well  as  possible  to  optimize  performance  within  reasonable  complexity 
constraints  based  on  intuitions  gained  from  the  results  of  previous  experiments  that 
we  have  carried  out.  The  parameter  settings  were: 
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IE:  za/2  =  3.0 

CASCADE  +  IE:  za/2  =  3.0, 6  =  .9999 

CASCADE  +  CTRL:  za/2  =  3.0, 6  =  .9999,  =  ZM,PA  =  20,  f?  =  100 

All  of  the  confidence-interval  parameters  are  set  to  3.0  and  the  decays  are  .9999.  The 
size  of  the  hypothesis  lists,  H ,  in  the  GTRL  algorithm  varies  linearly  as  a  function 
of  the  number  of  input  bits.  The  number  of  input  instances  required  for  promotion 
was  20  and  new  candidates  were  generated  once  every  100  ticks. 

10.1.4  Results 

Each  of  the  algorithms  was  run  for  10  trials  of  length  10,000  each.  This  is  is  a  small 
fraction  of  the  number  of  trials  that  would  be  required  for  the  agent  to  try  all  512 
possible  actions  in  each  of  512  possible  input  situations.  The  average  reinforcement 
for  each  algorithm  on  this  task  is 


IE  :  .1019 
CASCADE  +  IE  :  .1050 
CASCADE  +  GTRL  :  .1634 

The  cascaded  generate-and-test  algorithm  significantly  outperforms  either  of  the 
other  algorithms,  due  to  its  ability  to  generalize  both  over  the  input  and  output 
sets.  The  learning  curves  for  the  algorithms  are  shown  in  Figure  72.  As  we  can 
see,  the  GTRL  algorithm  improves  in  performance  significantly  more  quickly  than 
the  others. 


10.2  Mobile  Robot  Domain 

This  section  describes  the  application  of  algorithms  from  this  dissertation  to  a 
mobile-robot  learning  scenario.  There  have  been  very  few  implementations  of 
reinforcement-learning  algorithms  on  real  robotic  hardware.  A  notable  example 
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Figure  72:  Learning  curves  for  large,  random  environment. 


is  Maes  and  Brooks’  [39]  use  of  a  simple  algorithm  to  learn  to  coordinate  predefined 
behaviors  on  a  walking  robot.  A  number  of  researchers  have  applied  reinforcement- 
learning  algorithms  to  simulated  robotic  domains,  such  as  the  cart-pole  problem 
described  in  Chapter  3.  Franklin  [24]  used  learning-automata  techniques  and  the 
Ahp  algorithm  to  learn  to  adjust  the  outputs  of  an  existing  controller  to  compensate 
for  externally  applied  torques  on  a  simulated  robot  arm.  In  addition,  there  has  been 
work  on  learning  world  models,  such  as  Clocksin  and  Moore’s  [16],  Miller’s  [46],  and 
Mel’s  [42]  work  on  learning  a  mapping  from  joint  positions  to  visual  coordinates  in 
the  workspace  of  a  robotic  arm  [42]  and  Mason,  Christiansen,  and  Mitchell’s  [40] 
work  on  learning  the  results  of  using  a  robotic  arm  to  tip  a  tray  of  objects  in  various 
ways. 

The  robot  pictured  in  Figure  73  was  used  to  validate  a  variety  of  reinforcement- 
learning  algorithms.  It  has  two  drive  wheels,  one  on  each  side,  which  allow  it  to 
move  forward  and  backward  along  circular  arcs.  A  set  of  five  “feelers”  allow  it  to 
detect  obstacles  to  its  front  and  sides,  the  round  bumper  detects  contact  anywhere 
on  its  perimeter,  and  four  photosensors,  facing  forward,  backward,  left,  and  right, 
measure  the  light  levels  in  each  direction. 
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10.2.1  Algorithms 

The  same  algorithms  and  parameter  settings  were  used  in  this  experiment  as  in  the 
previous  one. 

10.2.2  Task 

In  this  task,  the  robot  is  given  negative  reinforcement,  normally  distributed  with 
mean  -2  and  standard  deviation  0.5  whenever  the  round  bumper  makes  contact  with 
any  physical  object.  If  the  bumper  is  not  engaged,  the  robot  is  given  positive  rein¬ 
forcement,  normally  distributed  with  mean  1  and  standard  deviation  0.2,  whenever 
the  light  in  its  front  sensor  gets  brighter.  If  the  bumper  has  not  engaged  and  the 
brightness  has  not  increased,  it  is  given  “zero”  reinforcement,  normally  distributed 
with  mean  0  and  standard  deviation  0.2. 

The  robot  interacts  with  the  world  by  making  fixed-length  motions,  either  for¬ 
ward  or  rotating  in  place  to  the  left  or  right.  The  agent  gets  the  following  five  bits 
of  input: 

Bits  0  and  1:  Which  direction  is  currently  the  brightest?  0  =  front,  1  =  left,  2  = 
right,  3  =  back. 

Bit  2:  Is  the  rightmost  feeler  engaged? 

Bit  3:  Is  the  leftmost  feeler  engaged? 

Bit  4:  Is  (at  least)  one  of  the  middle  three  feelers  engaged? 

The  agent  must  learn  a  mapping  from  this  input  space  to  its  three  actions  that 
maximizes  its  local  reinforcement.  It  develops  a  behavior  that  avoids  bumping  into 
obstacles  and  tends  to  move  toward  the  light. 

10.2.3  Results 

All  of  the  algorithms  were  run  in  the  real  robotic  domain,  with  varying  degrees  of 
success.  Ideally,  this  section  would  describe  a  long  series  of  trials  of  each  algorithm 
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ALG 

er 

IE 

.6439 

CASCADE  +  IE 

.6203 

CASCADE  +  GTRL 

.4930 

random 

.3074 

optimal 

.6695 

Table  11:  Average  reinforcement  for  simulated  mobile  robot  environment  over  100 
runs  of  length  2000. 

on  the  real  mobile  robot.  Unfortunately,  it  is  difficult  to  conduct  such  trials  fairly  in 
the  physical  system.  The  first  problem  is  that  a  human  must  intervene  whenever  the 
robot  approaches  the  light  source  and  move  the  robot  to  a  new  location.  The  second 
problem  is  that  it  takes  a  long  time  to  conduct  the  experiments.  The  time  that  it 
takes  the  robot  to  move  greatly  dominates  the  computation  time  of  the  learning 
algorithms.  So,  instead  of  trials  on  the  real  robot,  we  must  substitute  a  simulation 
of  the  robot  and  its  domain  described  above.  The  simulation  is  not  of  high  fidelity, 
which  causes  this  to  be  a  substantially  different  problem  than  that  of  running  on 
the  actual  robot.  Still,  it  serves  as  an  interesting  and  slightly  complex  domain 
for  testing  reinforcement-learning  algorithms.  Also,  the  results  in  the  simulated 
domain  mirror  informal  impressions  of  the  relative  performance  of  the  algorithms 
on  the  actual  robot. 

In  the  robot  simulation,  noise  is  added  to  the  action  and  perception  of  the  robot. 
Each  action  of  the  simulated  robot  is,  with  probability  .1,  changed  to  a  randomly 
chosen  action;  each  perception  of  the  state  of  the  world  is,  with  probability  .1, 
changed  to  a  randomly  chosen  world  state.  Whenever  the  robot  reaches  the  light 
source  in  the  simulated  world,  the  light  is  “teleported”  to  a  new  randomly- chosen 
location. 

The  results  of  running  each  algorithm  for  100  runs  of  length  2000  are  shown  in 
Table  11.  The  optimal  expected  reinforcement  value  was  estimated  by  running  a 
hand-crafted  non-learning  behavior  in  the  environment  under  the  same  conditions 
as  the  experimental  algorithms.  Similarly,  the  expected  reinforcement  of  a  random 
strategy  was  estimated  by  running  a  random  strategy  in  the  world.  All  of  the 
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Figure  74:  Learning  curves  for  the  simulated  mobile  robot  task. 

differences  in  expected  reinforcement  are  significant.  There  is  only  a  small  difference 
in  performance  between  the  pure  IE  algorithm  and  the  cascaded  version,  but  the 
GTRL  algorithm  performs  markedly  worse  than  either  of  them.  As  we  can  see  in  the 
learning  curves,  shown  in  Figure  74,  the  GTRL  algorithm  takes  longer  to  converge 
to  its  maximum  performance,  which  is  lower  than  optimal  because  it  is  continually 
trying  new  hypotheses. 

10.3  Robot  Domain  with  Delayed  Reinforcement 

The  previous  mobile  robot  domain  can  be  complicated  by  giving  the  robot  a  large 
reinforcement  only  when  it  reaches  the  light  source.  This  problem  is  considerably 
more  difficult  than  other  domains  used  for  delayed  reinforcement,  such  as  the  cart- 
pole  domain.  In  the  cart-pole  domain,  the  robot  receives  a  large  negative  reinforce¬ 
ment  value  whenever  the  pole  falls  over.  In  the  absence,  of  a  good  control  strategy, 
the  pole  will  fall  over  quite  readily,  giving  the  learner  a  lot  of  good  data  early  on.  In 
this  robot  domain,  the  robot  may  execute  its  initial  random  strategy  for  a  very  long 
time  before  it  accidentally  encounters  the  light  source.  Informal  experiments  with 
the  real  mobile  robot  were  only  successful  if  a  human  took  an  active  role  near  the 
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beginning  of  the  run,  putting  the  robot  in  situations  from  which  it  was  relatively 
easy  to  reach  the  light  and,  therefore,  get  useful  reinforcement  data.1 

This  section  will  report  formal  experiments  carried  out  in  a  simulated  version 
of  the  robotic  domain  with  delayed  reinforcement. 

10.3.1  Algorithms 

This  experiment  compares  the  same  algorithms  as  were  compared  in  the  experiment 
described  in  Section  9.6:  Q,  IEQ,  and  AHC  +  IE.  The  parameter  settings  were 

Q:  a  =  .95,  init=  20 

IEQ:  aie  =  .01 ,6  =  .9999 

AHC  +  IE:  a  =  .1,  A  =  .2,  6  =  .9999,  aie  =  .05 

10.3.2  Task 

The  inputs  and  outputs  available  to  the  agent  remain  the  same  as  in  the  local 
reinforcement  task.  The  reinforcement  generated  by  the  world  is,  in  this  domain, 
global  rather  than  local.  When  the  agent  comes  very  close  to  the  light  source, 
it  is  given  reinforcement  that  is  normally  distributed  with  mean  10  and  standard 
deviation  2.0;  when  it  bumps  into  an  obstacle,  it  is  given  reinforcement  normally 
distributed  with  mean  -2  and  standard  deviation  0.25;  finally,  if  it  neither  bumps 
into  the  wall  or  comes  near  the  fight,  it  is  given  reinforcement  normally  distributed 
with  mean  0  and  standard  deviation  0.25.  When  the  fight  is  reached  by  the  robot, 
it  is  randomly  moved  to  a  new  location. 

10.3.3  Results 

The  results  of  running  each  algorithm  for  10  runs  of  length  10,000  are  shown  in 
Table  12.  As  before,  the  optimal  expected  reinforcement  value  was  estimated  by 

xThis  process  is  an  instance  of  a  class  of  methods  for  expediting  learning  that  are  referred  to  by 
psychologists  [32]  as  “shaping.”  Its  use  in  the  robot  domain  described  here  was  suggested  by  R. 
Sutton. 
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ALG 

er 

Q 

.1634 

IEQ 

.1828 

AHC  +  IE 

.3651 

random. 

.0000 

optimal 

.8269 

Table  12:  Average  reinforcement  for  simulated  robot  domain  with  delayed  reinforce¬ 
ment  over  10  rims  of  length  10,000. 

running  a  hand-crafted  non-learning  behavior  in  the  environment  under  the  same 
conditions  as  the  experimental  algorithms.  Similarly,  the  expected  reinforcement  of 
a  random  strategy  was  estimated  by  running  a  random  strategy  in  the  world.  The 
performance  of  AHC  -f  IE  was  significantly  better  than  that  of  Q  or  IEQ,  which  were 
not  significantly  different  from  one  another.  The  learning  curves  for  this  domain 
are  shown  in  Figure  75.  The  poor  performance  of  the  algorithms  in  this  domain 
may  be  somewhat  deceiving.  In  many  cases,  the  learning  strategies  learned  quickly 
to  perform  at  near-optimal  levels.  However,  in  many  other  cases,  the  robot  never, 
or  only  late  in  the  run,  acquired  enough  experience  with  the  light  source  to  learn  an 
appropriate  strategy.  It  is  likely  that  if  the  runs  were  another  order  of  magnitude 
longer  than  those  reported  here,  the  asymptotic  performance  of  each  of  the  algo¬ 
rithms  would  be  very  high.  For  this  reason,  a  “shaping”  process  used  early  in  the 
runs  would  allow  the  agent  to  get  more  useful  information  and  hence  improve  its 
performance.  An  interesting  area  for  future  research  would  be  to  formally  specify 
such  shaping  processes  and  characterize  their  role  in  expediting  learning. 
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Figure  75:  Learning  curves  for  the  simulated  delayed-reinforcement  mobile  robot 
task. 


Chapter  11 
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Conclusion 


Simple  reinforcement-learning  problems  can  be  effectively  solved  using  the  interval- 
estimation  algorithm.  It  has  two  serious  limitations,  however.  First,  its  computa¬ 
tional  complexity  increases  exponentially  in  the  size  of  the  input  and  output  spaces. 
Second,  it  exhibits  no  generalization  across  input  and  output  instances. 

These  problems  have  been  addressed  by  the  use  of  linear-association  and  error 
back-propagation  methods  for  associative  reinforcement-learning.  Each  of  these 
methods  has  its  own  problems.  The  linear-association  method  can  only  learn  action 
maps  that  are  in  the  class  of  lineaxly-sepaxable  functions.  Error  backpropagation 
methods  can,  theoretically,  learn  functions  of  arbitrarily  complexity,  but  it  generally 
requires  a  large  number  of  presentations  of  the  learning  data  and  is  very  sensitive 
to  internal  parameter  values. 

This  dissertation  has  addressed  the  problem  of  finding  new  algorithms  for  effi¬ 
ciently  learning  limited  classes  of  action  maps  from  reinforcement. 

The  first  step  was  to  simplify  the  job  of  the  algorithm  designer  by  reducing 
the  problem  of  learning  action  maps  with  many  output  bits  to  the  problem  of 
learning  action  maps  with  single  output  bits.  The  CASCADE  method  implements 
this  problem  reduction,  providing  decreased  time  complexity  and  improved  learning 
rates,  as  well. 

Valiant’s  algorithm  for  learning  Boolean  functions  in  fc-DNF  provided  a  useful 
foundation  for  creating  new  reinforcement-learning  algorithms.  The  LARCKDNF  and 
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IEKDNF  algorithms  integrate  the  ideas  of  linear-associative  reinforcement-comparison 
and  of  interval-estimation  with  Valiant’s  methods.  These  new  algorithms  efficiently 
learn  action  maps  in  fc-DNF:  they  Eire  both  more  time-efficient  than  the  raw  IE  al¬ 
gorithm,  require  fewer  presentations  of  data  than  the  BP  algorithm,  and  can  learn 
a  large  class  of  functions  than  linear- associative  approaches. 

The  GTRL  algorithm  is  also  an  algorithm  for  learning  Boolean  functions  from 
reinforcement.  Its  main  advantage  is  that  it  can  learn  low-complexity  functions 
very  efficiently;  however,  by  changing  internal  parameter  values,  it  can  be  config¬ 
ured  to  learn  a  variety  of  different  classes  of  functions  with  different  computational 
complexities.  In  addition,  its  use  of  internal  symbolic  representations  allows  it  to 
be  extended  to  learn  simple  sequential  networks. 

All  of  this  work  has  only  addressed  the  problem  of  local  learning  from  immedi¬ 
ate  reinforcement.  Existing  work  on  temporal  difference  methods  can  also  be  seen 
as  a  problem  reduction.  It  reduces  the  problem  of  global  learning  from  delayed 
reinforcement  to  the  problem  of  local  learning  from  non-stationary  immediate  rein¬ 
forcement.  This  perspective  allows  TD  methods  to  be  integrated  with  any  available 
local  learning  method. 

All  of  these  methods  can  be  integrated  in  various  ways,  such  as  using  the  CAS¬ 
CADE  and  TD  problem  reductions  together  with  the  GTRL,  LARCKDNF,  or  IEKDNF 
algorithms  to  construct  an  algorithm  that  learns  an  action  mapping  with  many 
output  bits  from  delayed  reinforcement.  These  combined  methods  have  been  tested 
and  shown  to  work  robustly  on  a  physical  mobile  robot,  demonstrating  their  appli¬ 
cability  to  embedded  systems  in  the  real  world. 

The  rest  of  this  chapter  consists  of  two  sections.  The  first  briefly  lists  the 
novel  contributions  of  the  work  described  in  this  dissertation.  The  second  discusses 
directions  for  extending  this  research. 


11.1  Contributions 

The  work  described  in  this  dissertation  has  made  a  number  of  contributions  to 
solving  the  problem  of  learning  in  embedded  systems.  They  are  summarized  below, 
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organized  in  the  order  in  which  they  were  presented. 

Foundations:  The  description  of  the  foundations  of  reinforcement  learning  in¬ 
tegrates  existing  work  in  dynamic  programming,  learning-automata  theory, 
statistics,  and  previous  work  on  the  foundations  of  reinforcement  learning 
into  a  general  framework  for  describing  learning  behaviors  and  measuring 
their  success.  In  addition  to  making  the  existing  work  more  accessible  to  AI 
researchers,  this  formulation  makes  it  easier  for  researchers  to  compare  their 
results  directly  and  to  share  implementations  of  learning  behaviors  and  of 
simulated  environments. 

Interval  Estimation  Algorithm:  The  interval  estimation  algorithm  is  a  novel 
extension  of  existing  methods  for  reinforcement  learning  that  is  grounded 
directly  in  statistical  theory.  In  empirical  tests,  it  learns  more  effectively  than 
other  algorithms  of  its  kind.  However,  its  computational  complexity  makes  it 
impractical  for  use  on  large  problems. 

Cascade  Method:  The  cascade  method  of  building  a  reinforcement  learner  with 
many  output  bits  from  a  collection  of  reinforcement  learners  with  one  output 
bit  is  new.  It  has  been  shown  that  if  each  of  the  individual  components  has 
learned  to  perform  the  behavior  that  is  correct  for  it,  the  entire  system  will 
perform  the  behavior  that  is  correct  overall.  The  cascade  method  works  well 
in  empirical  tests,  often  resulting  in  improved  convergence  rates  as  well  as 
lower  time  complexity. 

Reinforcement  Learning  of  fc-DNF:  Two  algorithms  are  presented  that  learn 
Boolean  functions  from  reinforcement,  based  on  Valiant’s  concept  learning 
algorithm  for  concepts  expressible  in  fc-DNF.  One  uses  the  techniques  of  the 
interval  estimation  algorithm,  while  the  other  is  derived  from  Sutton’s  linear- 
association  reinforcement-comparison  algorithm.  They  are  both  computation¬ 
ally  much  more  efficient  than  standard  methods  and  perform  nearly  as  well 
as  standard  methods  on  a  variety  of  tasks. 
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Generate-and-Test  Reinforcement  Learner:  The  GTRL  algorithm  is  a  novel 
reinforcement-learning  method  that  uses  syntactic  search  through  the  space 
of  Boolean  function  descriptions  to  learn  single-bit  output  functions  from  re¬ 
inforcement.  It  is  based  on,  but  has  diverged  significantly  from,  Schlimmer’s 
STAGGER  system,  using  statistical  measures  of  necessity  and  sufficiency  to 
guide  its  search.  It  is  highly  configurable  and  can  learn  low-complexity  func¬ 
tions  very  efficiently,  even  in  the  presence  of  a  large  number  of  irrelevant 
attributes. 

Action  Maps  with  State:  The  generate-and-test  reinforcement  learner  was  ex¬ 
tended  by  adding  set-reset  as  an  additional  binary  operator.  This  extension 
allows  simple  action  maps  whose  output  depends  on  input  values  from  arbi¬ 
trarily  far  back  in  history  to  be  constructed.  Although  the  method  cannot 
generate  all  possible  sequential  networks,  it  does  represent  a  first  effort  at 
learning  action  maps  with  state  directly  from  reinforcement. 

Delayed  Reinforcement:  Two  existing  approaches  to  learning  from  delayed  rein¬ 
forcement  were  combined  with  the  interval-estimation  method  to  yield  robust 
algorithms.  Watkins’  Q-learning  method  was  extended  to  use  the  techniques 
of  the  interval-estimation  method  to  keep  the  algorithm  from  converging  pre¬ 
maturely  to  suboptimal  solutions.  In  addition,  Sutton’s  AHC  method  of  learn¬ 
ing  to  generate  a  local  reinforcement  signal  was  tested  with  the  IE  algorithm 
as  the  local  learning  component. 

Mobile  Robot  Experiments:  Many  of  the  algorithms  described  here  were  tested 
on  a  mobile  robot  in  a  moderately  complex  and  noisy  physical  environment. 
In  these  experiments,  the  algorithms  were  successfully  used  to  learn  control 
strategies  and  exhibited  considerable  robustness. 

11.2  Future  Work 

There  is  a  long  list  of  interesting  variations  and  extensions  that  could  be  made  to 

the  work  described  in  this  dissertation.  Many  of  them  are  suggested  at  the  ends  of 
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the  relevant  chapters.  As  well  as  considering  local  improvements,  it  is  important  to 
understand  the  global  setting  of  this  work. 

Tabula  rasa  learning,  as  described  in  this  dissertation,  may  not  be  a  sufficient 
%  method  for  creating  intelligent  embedded  agents.  However,  the  methods  of  rein¬ 

forcement  learning  may  be  used  in  concert  with  other  knowledge  provided  in  differ¬ 
ent  forms  by  a  human  programmer,  in  order  to  construct  agents  that  start  with  a 
n  useful  base  of  knowledge  and  can  improve  upon  it.  Knowledge  might  be  provided 

by  programmers  in  a  number  of  different  forms. 

One  of  the  simplest  kinds  of  information  that  would  improve  the  performance 
of  reinforcement-learning  algorithms  is  the  expected  reinforcement  of  the  optimal 
policy.  An  agent  that  has  this  information  can  use  it  to  make  more  informed  trade¬ 
offs  between  acting  to  gain  information  and  acting  to  gain  reinforcement.  The 
agent  will  be  able  to  tell  when  it  has  found  the  best  policy  and  need  not  experiment 
further. 

Russell  [59]  has  introduced  the  idea  of  using  determinations  to  bias  learning. 
Determinations  are,  essentially,  descriptions  of  which  input  values  the  outputs  de¬ 
pend  on.  Such  information  would  be  of  great  help  in  constraining  the  search  done 
by  the  GTRL  algorithm  or  in  limiting  the  size  of  the  set  of  conjunctive  terms  in  the 
fc-DNF  algorithms. 

Finally,  we  might  start  from  a  complete  or  partial  program  specified  in  terms 
of  condition-action  rules.  An  interesting  research  direction  would  be  to  develop 
representations  of  programs  that  are  amenable  to  adjustment  using  reinforcement- 
learning  methods. 
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Appendix  A 

Statistics  in  GTRL 


This  appendix  describes  three  statistical  modules  that  can  be  used  with  the  GTRL 
algorithm.  They  can  be  applied  when  reinforcement  is  binomially  or  normally 
distributed,  as  well  as  in  cases  for  which  there  is  no  model.  Each  module  implements 
the  statistical  functions  described  in  Section  7.3. 

A.l  Binomial  Statistics 

Each  hypothesis  has  the  following  set  of  statistics  associated  with  it: 

b0  The  number  of  times  this  hypothesis  has  agreed  with  the  action  0  (not  necessarily 
chosen  by  it)  and  received  reinforcement  value  0  (mnemonically  “bad  0”). 

bi  The  number  of  times  this  hypothesis  has  agreed  with  the  action  1  and  received 
reinforcement  value  0. 

The  number  of  times  this  hypothesis  has  agreed  with  the  action  0  and  received 
reinforcement  value  1  (mnemonically,  “good  0”). 

The  number  of  times  this  hypothesis  has  agreed  with  the  action  1  and  received 
reinforcement  value  1. 

pb0  The  number  of  times  this  hypothesis  has  chosen  the  action  0  and  received 
reinforcement  value  0  (mnemonically,  “predicted  bad  0”). 
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pbi  The  number  of  times  this  hypothesis  has  chosen  the  action  1  and  received 
reinforcement  value  0. 

pgo  The  number  of  times  this  hypothesis  has  chosen  the  action  0  and  received 
reinforcement  value  1  (mnemonically,  predicted  good  0). 


pgi  The  number  of  times  this  hypothesis  has  chosen  the  action  1  and  received 
reinforcement  value  1. 


The  procedure  for  updating  these  statistics  should  be  apparent  from  the  descriptions 
given  above. 

Given  this  data  structure,  we  can  define  the  statistical  functions  as  follows: 


age(h) 

er(h) 

er-ub(h) 

erp(h) 

erp-ub(h) 

N(h) 

S(h) 


bo  +  bi  +  g0  +  gi 
go  4-  gi 

bo  +  b\  +  go  +  gi 

ub(g0  +  gi,  b0  +  bi  +  g0  +  gx) 

P9o  +  m 

pb0  +  pbi  +  pgo  +  Pgi 
ub(pg0  +  pgi ,  pb0  +  pbi  +pg0  +  pgi ) 
go 

go  +  bo 
gi 

gi  +6i 


where  the  upper-bound  function,  ub,  is  defined  [36]  as 


ufe(x,n) 


The  parameter  za/2  is  used  to  determine  the  size  of  the  confidence  interval  for 
computing  ub. 
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A.2  Normal  Statistics 

Each  hypothesis  has  the  following  set  of  statistics  associated  with  it: 

no  The  number  of  times  this  hypothesis  has  agreed  with  the  action  0  (not  necessarily 
chosen  by  it). 

ni  The  number  of  times  this  hypothesis  has  agreed  with  the  action  1. 

so  The  sum  of  the  reinforcement  values  received  when  the  hypothesis  has  agreed 
with  the  action  0. 

Si  The  sum  of  the  reinforcement  values  received  when  the  hypothesis  has  agreed 
with  action  1. 

ss  The  sum  of  the  squares  of  the  reinforcement  values  received  when  the  hypothesis 
has  agreed  with  the  action  taken. 

np  The  number  of  times  this  hypothesis  has  chosen  an  action. 

sp  The  sum  of  reinforcement  values  received  when  the  hypothesis  has  chosen  an 
action. 

ssp  The  sum  of  the  squares  of  the  reinforcement  values  received  when  the  hypothesis 
has  chosen  an  action. 

The  procedure  for  updating  these  statistics  should  be  apparent  from  the  descriptions 

given  above. 

Given  this  data  structure,  we  can  define  the  statistical  functions  as  follows: 


age(h)  = 
er(h )  = 
er-vb(h)  = 
erp(h)  = 

er-ub(h)  = 


«o  + 

So  + 
n0  +  ni 

nub(n0  +  ni,s0  +  &i,ss) 

£p 

np 

nub(np,  sp,  ssp) 
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N(k)  =  i 

n0 

S(h)  =  ii 

nj 


where  the  normal  upper-bound  function,  nu&,  is  defined  as 


y  +  t 


(n-l) 

a/2 


where  y  =  x/n  is  the  sample  mean, 


nE*2-(E 

n(n  —  1) 


A 

is  the  sample  standard  deviation,  and  t^2  is  Student’s  t  fimction  with  n  —  1  degrees 
of  freedom  [69].  The  parameter  za/ 2  is  used  to  determine  the  size  of  the  confidence 
interval  for  computing  nub. 
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A. 3  Non-parametric  Statistics 

This  statistical  module  is  parametrized  by  w ,  the  window  size,  as  well  sis  by  the 
i  confidence-interval  parameter  z0/2-  The  parameter  w  controls  the  size  of  the  data 

buffers  kept  by  the  module.  Because  this  method  employs  no  summary  statistics, 

all  of  the  data  for  the  last  w  ticks  are  stored  in  this  module.  Each  hypothesis  has 
f  the  following  set  of  statistics  associated  with  it: 

n  The  number  of  times  this  hypothesis  has  agreed  with  the  action  taken. 

rt  A  list  of  the  reinforcement  values  of  the  last  w  ticks  on  which  this  hypothesis 
agreed  with  the  action  taken,  sorted  increasing  by  time  received. 

rv  A  list  of  the  reinforcement  values  of  the  last  w  ticks  on  which  this  hypothesis 
agreed  with  the  action  taken,  sorted  increasing  by  value. 

n0  The  number  of  times  this  hypothesis  has  agreed  with  the  action  0. 

rto  A  list  of  the  reinforcement  values  of  the  last  w  ticks  on  which  this  hypothesis 
agreed  with  the  action  0,  sorted  increasing  by  time  received. 

rv o  A  list  of  the  reinforcement  values  of  the  last  w  ticks  on  which  this  hypothesis 
agreed  with  the  action  0,  sorted  increasing  by  value. 

ni  The  number  of  times  this  hypothesis  has  agreed  with  the  action  1. 

rti  A  list  of  the  reinforcement  values  of  the  last  w  ticks  on  which  this  hypothesis 
agreed  with  the  action  1,  sorted  increasing  by  time  received. 

7 

r„!  A  list  of  the  reinforcement  values  of  the  last  w  ticks  on  which  this  hypothesis 
agreed  with  the  action  1,  sorted  increasing  by  value. 

r 

np  The  number  of  times  this  hypothesis  has  chosen  the  action. 

rtp  A  list  of  the  reinforcement  values  of  the  last  w  ticks  on  which  this  hypothesis 
chose  the  action,  sorted  increasing  by  time  received. 
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rvp  A  list  of  the  reinforcement  values  of  the  last  w  ticks  on  which  this  hypothesis 
chose  the  action,  sorted  increasing  by  value. 

Updating  these  statistics  is  slightly  more  complex  that  in  the  previous  cases.  The 
n’s  are  simply  incremented  appropriately.  As  long  as  the  n  value  is  less  than  or 
equal  to  w ,  new  data  are  simply  inserted  into  the  appropriate  places  in  the  lists. 
Once  n  is  greater  than  w,  on  each  tick,  the  first  element  of  rt  is  removed  from  both 
rt  and  rv,  and  the  new  reinforcement  value  is  inserted  into  the  resulting  r„  and  put 
on  the  end  of  the  resulting  rt.  This  keeps  the  window  of  data  sliding  along.  We 
need  r(  in  order  to  know  which  element  to  remove  from  rv  before  we  can  add  a  new 
element. 

Given  this  data  structure,  we  can  define  the  statistical  functions,  using  the 
ordinary  sign  test  [26],  as  follows: 


age(h) 

= 

n 

tr(h ) 

= 

rv[[min(u;,n)/2j] 

er-ub(h) 

= 

rv[min(u>,n)  —  u] 

erp(h) 

= 

r„p[[min(u!,  np)/2j] 

er-ub(h) 

= 

rvp[min(ttJ,  np )  -  u] 

N(h) 

rv0[Lmin(u;,no)/2j] 

S(h) 

= 

rt;i[Lmin(tr,  n!)/2j] 

where  value  u  is  chosen  to  be  the  largest  value  such  that 

(  \ 

VnJfc.5"  ^  <*/2  •/ 

For  large  values  of  n,  u  can  be  approximated  using  the  normal  distribution. 
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Appendix  B 

Simplifying  Boolean  Expressions 
in  GTRL 


This  appendix  describes  the  Boolean  canonicalization  and  simplification  rules  that 
are  used  in  the  GTRL  algorithm.  It  is  assumed  that  simplification  happens  when 
a  conjunction,  disjunction,  or  set-reset  expression  is  being  constructed  and  that 
the  arguments  have  already  been  simplified  and  canonicalized.  The  algorithm  is 
described  as  first  constructing  the  combined  hypothesis,  then  testing  to  see  if  has 
depth  appropriate  to  the  level  of  the  algorithm  for  which  it  was  constructed.  In 
fact,  the  procedures  for  constructing  composite  hypotheses  simply  return  nil  if  any 
applicable  simplification  rules  can  be  found. 

The  disjunctive  hypothesis  ex  V  e2  can  be  simplified  to  a  lower  level  of  complexity 
if  any  of  the  following  statements  is  true  (e  stands  for  any  expression): 


Cl 

= 

c2 

Cl 

= 

false 

ei 

= 

true 

e2 

= 

false 

e2 

= 

true 

ei 

= 

->e2 

e2 

— 

“’Ci 
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tx  =  e2  V  e 
t\  —  e  V  c2 
e2  =  ci  V  e 
e2  =  eVci 
ei  =  e2  A  e 
ej  =  e  A  e2 
e2  =  ej  A  e 
e2  =  e  A  ex 

The  conjunctive  hypothesis  ei  A  e2  can  also  be  simplified  in  any  of  the  situations 
described  above.  The  set-reset  hypothesis  SR(ei,e2)  can  be  simplified  in  all  of  the 
situations  described  above,  except  the  ones  in  which  Ci  =  e2  A  e  or  ei  =  c  A  e2. 
To  see  this,  note  that  SR(a,  a  A  b)  =  SR(a ,  b)  because  setting  takes  priority,  but 
SR(a  A  6,  a)  cannot  be  reduced. 

Canonicalization  consists  of  ordering  the  two  top-level  subexpressions,  because 
they  are  assumed  to  have  already  been  canonicalized.  An  arbitrary  ordering  is  de¬ 
fined  on  operators;  atomic  expressions  referring  to  input  bits  are  ordered  according 
to  their  index  into  the  input  vector.  The  expression  ei  is  less  than  expression  e2  if 
and  only  if 

•  ei  and  e2  are  both  atoms  and  ci  <  e2; 

•  ei  is  an  atom  and  e2  is  not; 

•  neither  ei  nor  e2  is  an  atom  and  the  top  level  operator  of  Ci  is  less  than  the 
top  level  operator  of  e2; 

•  neither  ei  nor  e2  is  an  atom,  they  both  have  the  same  top-level  operator, 
and  the  first  subexpression  of  ci  is  less  than  (under  this  definition)  the  first 
subexpression  of  e2;  or 

•  neither  ex  nor  e2  is  an  atom,  they  both  have  the  same  top-level  operator,  they 
both  have  the  same  first  subexpression,  and  the  second  subexpression  of  ei  is 
less  than  (under  this  definition)  the  second  subexpression  of  e2. 
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